0% found this document useful (0 votes)
8 views17 pages

2307.01197v1

k

Uploaded by

Anand Dnyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views17 pages

2307.01197v1

k

Uploaded by

Anand Dnyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Segment Anything Meets Point Tracking

Frano Rajič1,3 Lei Ke1,2 Yu-Wing Tai2 Chi-Keung Tang2 Martin Danelljan1 Fisher Yu1
1 2 3
ETH Zürich HKUST EPFL
arXiv:2307.01197v1 [cs.CV] 3 Jul 2023

Legend:
Point
SAM positive points
Tracker negative points
occluded points
Input: Video and Query Points Predicted Trajectories and Occlusion Output: Predicted Masks

Figure 1. Segment Anything Meets Point Tracking (SAM-PT). SAM-PT is the first method to utilize sparse point propagation for Video
Object Segmentation (VOS). The essence of SAM-PT is to extend SAM [16] with long-term point trackers to effectively operate on videos
in a zero-shot manner. SAM-PT takes a video as input together with annotations of the target object in the first frame. These annotations
are called “query points” and denote either the target object (positive points) or designate non-target segments (negative points). The points
are tracked throughout the video using point trackers that propagate the query points to all video frames, producing predicted trajectories
and occlusion scores. SAM is subsequently prompted with the non-occluded points in the trajectories as to output a segmentation mask for
each video frame independently.

Abstract Medoids clustering for point initialization and track both


positive and negative points to clearly distinguish the target
object. We also employ multiple mask decoding passes for
The Segment Anything Model (SAM) has established it- mask refinement and devise a point re-initialization strategy
self as a powerful zero-shot image segmentation model, to improve tracking accuracy. Our code integrates different
employing interactive prompts such as points to gener- point trackers and video segmentation benchmarks and will
ate masks. This paper presents SAM-PT, a method ex- be released at https://github.com/SysCV/sam-
tending SAM’s capability to tracking and segmenting any- pt.
thing in dynamic videos. SAM-PT leverages robust and
sparse point selection and propagation techniques for mask
generation, demonstrating that a SAM-based segmenta- 1. Introduction
tion tracker can yield strong zero-shot performance across
popular video object segmentation benchmarks, including Video segmentation benefits a myriad of applications,
DAVIS, YouTube-VOS, and MOSE. Compared to traditional including autonomous driving, robotics, and video editing.
object-centric mask propagation strategies, we uniquely use Despite significant progress made in the past few years with
point propagation to exploit local structure information that deep neural networks [3, 4, 37, 41], the current methodolo-
is agnostic to object semantics. We highlight the merits of gies falter when faced with unseen data, particularly in zero-
point-based tracking through direct evaluation on the zero- shot settings. These models struggle to maintain consistent
shot open-world Unidentified Video Objects (UVO) bench- performance across diverse scenarios without specific video
mark. To further enhance our approach, we utilize K- segmentation data for fine-tuning.

1
The prevailing methods [3, 4] in semi-supervised Video predicted using state-of-the-art point trackers, such as
Object Segmentation (VOS) and Video Instance Segmen- PIPS [11], harnessing their versatility for video segmen-
tation (VIS) exhibit performance gaps when dealing with tation. We identify that initializing points to track using
unseen data, particularly in a zero-shot setting, i.e., when K-Medoids cluster centers from a mask label was the strat-
these models are transferred to video domains they have not egy most compatible with prompting SAM. Tracking both
been trained on and encompass object categories that fall positive and negative points enables the clear delineation of
outside of the training distribution. target objects from their background. To further refine the
A potential route towards overcoming these challenges output masks, we propose multiple mask decoding passes
lies in adapting successful models in the image seg- that integrate both types of points. In addition, we devised
mentation domain for video segmentation tasks. One a point re-initialization strategy that increases tracking ac-
such promising model is the Segment Anything Model curacy over time. This approach involves discarding points
(SAM) [16]. SAM is a powerful foundation model for im- that have become unreliable or occluded, and adding points
age segmentation, trained on the large-scale SA-1B dataset, from object parts or segments that become visible in later
which contains an astounding 11 million images and over frames, such as when the object rotates.
1 billion masks. This extensive training set enables SAM’s Notably, our experimental results highlight that SAM-
impressive zero-shot generalization capabilities. The model PT competes with existing zero-shot methods [36] or out-
is highly adaptable, able to produce high-quality masks performs them [2, 12, 35, 45] on several video segmentation
from single foreground points, and has demonstrated ro- benchmarks. This comes without the need for any video
bust performance across a range of downstream tasks un- segmentation data during training, underscoring the robust-
der zero-shot transfer protocols. While SAM demonstrates ness and adaptability of our approach. SAM-PT holds the
powerful zero-shot capabilities for image segmentation, it potential to enhance progress in video segmentation tasks,
is not innately suited for video segmentation tasks. particularly in zero-shot scenarios.
Recent efforts have been made to adapt SAM for
video segmentation. For instance, TAM [40] integrates
2. Related Work
SAM with the state-of-the-art memory-based mask tracker Point Tracking for Video Segmentation. Classical fea-
XMem [4]. Likewise, SAM-Track [6] combines SAM with ture extraction and tracking methods such as Lucas-
DeAOT [41]. While these methods mostly recover the per- Kanade [23], Tomasi-Kanade [31], Shi-Tomasi [29],
formance on in-distribution data, they fall short in preserv- SIFT [22], and SURF [1], as well as newer methods
ing the original performance of SAM in more challeng- such as LIFT [43], SuperPoint [7], and SuperGlue [28],
ing, zero-shot settings. Other methods that do not leverage have all demonstrated proficiency in identifying or track-
SAM, such as SegGPT [36], can successfully solve a num- ing sparse features and establishing long-range correspon-
ber of segmentation problems using visual prompting, but dences. However, their effectiveness is confined to a spe-
still require mask annotation for the first video frame. This cific set of distinct interest points and they often struggle
problem represents a significant barrier in zero-shot video when applied to non-rigid, dynamic scenes. Flow-based
segmentation, particularly as we seek to develop methods methods, such as RAFT [30], excel in tracking dense points
that can easily generalize to unseen scenarios and con- between successive frames. However, they stumble with de-
sistently deliver high-quality segmentation across diverse riving accurate long-range point trajectories. When chain-
video domains. ing flow predictions over time, errors tend to accumulate
We introduce SAM-PT (Segment Anything Meets Point and lead to drift, while occlusions result in tracking failures.
Tracking), depicted in Fig. 1. This is the first method to uti- Significant strides have recently been made in long-term
lize sparse point tracking combined with SAM for video point tracking across video frames, as evinced by methods
segmentation, offering a new perspective on solving the such as TapNet [9] and PIPS [11], as well as the concurrent
problem. Instead of employing object-centric dense feature and state-of-the-art OmniMotion [32] and TAPIR [10] tech-
matching or mask propagation, we propose a point-driven niques. These approaches optimize long-range point trajec-
approach that capitalizes on tracking points using rich local tories across an entire video, navigating mostly well through
structure information embedded in videos. As a result, it periods of occlusion.
only requires sparse points annotation to denote target ob- Our work stands apart as the first to integrate these suc-
ject in the first frame and provides better generalization to cessful long-term point tracking methods, utilizing them to
unseen objects, a strength demonstrated on the open-world guide a promptable foundation model for image segmenta-
UVO [34] benchmark. This approach also helps preserve tion toward performing video segmentation tasks.
the inherent flexibility of SAM while extending its capabil-
ities effectively to video segmentation. Segment and Track Anything models. SAM [16] is an
SAM-PT prompts SAM with sparse point trajectories innovative image segmentation model for promptable im-

2
age segmentation, trained on over 1 billion segmentation PIPS [11] and TapNet [9], with the powerful image seg-
masks. It showcases remarkable zero-shot generalization mentation of SAM to enable tracking of anything in videos.
abilities and can produce high-quality masks from a sin- First, Sec. 3.1 briefly describes SAM. Sec. 3.2 then intro-
gle foreground point. To further improve the quality of the duces our SAM-PT method with its four constituent steps.
masks, especially when segmenting objects with intricate Finally, Sec. 3.3 analyzes and highlights the method’s nov-
structures, HQ-SAM [15] extends SAM with a learnable elty as the first point-driven video segmentation method
high-quality output token which proves efficient in diverse compared to existing works.
segmentation domains. However, SAM and HQ-SAM can-
not be directly used to solve video segmentation tasks. 3.1. Preliminaries: SAM
A few concurrent works extend SAM, for example, The Segment Anything Model (SAM) [16] is a novel
TAM [40] and SAM-Track [6] combine SAM with state-of- vision foundation model designed for promptable image
the-art mask trackers (such as XMem [4] and DeAOT [42]) segmentation. SAM is trained on the large-scale SA-1B
to perform interactive video object segmentation. These dataset, which contains 11 million images and over 1 billion
methods employ SAM for mask initialization or correc- masks. SA-1B has 400 times more masks than any exist-
tion and XMem/DeAOT for mask tracking and predic- ing segmentation dataset. This extensive training set facil-
tion. Using the pre-trained mask trackers recovers the in- itates SAM’s impressive zero-shot generalization capabili-
distribution performance, but hinders the performance in ties to new data. SAM has showcased its ability to produce
zero-shot settings. PerSAM [45] also demonstrates the abil- high-quality masks from a single foreground point, and has
ity to track multiple reference objects in a video. Instead demonstrated robust generalization capacity on a variety of
of building an interactive tracking pipeline or SAM fine- downstream tasks under a zero-shot transfer protocol using
tuning, we focus on learning robust associations for diverse prompt engineering. These tasks include, but are not limited
objects in zero-shot scenarios. to, edge detection, object proposal generation, and instance
segmentation.
Zero-shot VOS / VIS. Among the non-SAM-based meth- SAM comprises of three main components: an image
ods, Painter [35] and its SegGPT [36] extension are an- encoder, a flexible prompt encoder, and a fast mask de-
other sort of generalist models for solving a variety of image coder. The image encoder is a Vision Transformer (ViT)
and segmentation tasks. These methods likewise use visual backbone and processes high-resolution 1024 × 1024 im-
prompting techniques but are inherently different frame- ages to generate an image embedding of 64 × 64 spatial
works from SAM. Despite its wide applicability, Painter size. The prompt encoder takes sparse prompts as input, in-
shows lacking performance in video segmentation tasks. cluding points, boxes, and text, or dense prompts such as
Conversely, SegGPT successfully uses in-context prompt- masks, and translates these prompts into c-dimensional to-
ing to achieve one-shot video object segmentation perfor- kens. The lightweight mask decoder then integrates the im-
mance comparable to ours, also without training on any age and prompt embeddings to predict segmentation masks
video data. The training domains, however, notably differ in real-time, allowing SAM to adapt to diverse prompts with
between SegGPT and our method. minimal computational overhead.
STC [12] and DINO [2] also do not use any video seg-
mentation data during training. In the semi-supervised 3.2. Ours: SAM-PT
video object segmentation, they take a reference mask as While SAM shows impressive capabilities in image seg-
input and perform frame-by-frame feature matching, which mentation, it is inherently limited in handling video seg-
propagates the reference mask across the entirety of the mentation tasks. Our Segment Anything Meets Point Track-
video. Our SAM-PT, on the other hand, diverges substan- ing (SAM-PT) approach effectively extends SAM to videos,
tially from these methodologies by adopting point tracking, offering robust video segmentation without requiring train-
eschewing the process of frame-by-frame feature match- ing on any video segmentation data.
ing. Additionally, our method requires only sparse points SAM-PT is illustrated in Fig. 2 and is primarily com-
to represent the target object, rather than a full reference posed of four steps: 1) selecting query points for the first
mask, and yields superior performance on conventional frame; 2) propagating these points to all video frames using
semi-supervised video object segmentation benchmarks. point trackers; 3) using SAM to generate per-frame segmen-
tation masks based on the propagated points; 4) optionally
3. Method reinitializing the process by sampling query points from the
We propose SAM-PT to adapt SAM, a foundation model predicted masks. We next elaborate on these four steps.
for image segmentation, for addressing video segmenta-
tion tasks in a zero-shot setting. SAM-PT combines the 1) Query Points Selection. The process begins with
strengths of existing prominent point trackers, such as defining query points in the first video frame, which either

3
Step 1)
Query Points
Selection

Point types: t=i t=i+1 t=i+2 t=i+3 t=i+4


positive
Step 2) Step 2)
negative
occluded Point Tracker Point Tracker
Query
Step 4) Points
Reinitialization

Step 3) Step 3) Step 3) Step 3) Step 3)


SAM SAM SAM SAM SAM

Figure 2. Segment Anything Meets Point Tracking (SAM-PT) overview. The essence of SAM-PT is to extend image segmentation
foundation model to effectively operate on videos. SAM-PT has four steps: 1) Query Points Selection. It starts with first-frame query
points which denote the target object (positive points) or designate non-target segments (negative points). These points are provided by
the user or derived from a ground truth mask. 2) Point Tracking. Initiated with the query points, our approach leverages point trackers
to propagate the points across video frames, predicting point trajectories and occlusion scores. 3) Segmentation. The trajectories are
then used to prompt the Segment Anything Model (SAM) and output per-frame mask predictions. 4) Point Tracking Reinitialization.
Optionally, the predicted masks are used to reinitialize the query points and restart the process when reaching a prediction horizon h.
Re-initialization helps by getting rid of unreliable and occluded points, and adds points from object parts or segments that become visible
in later frames, such as when the object rotates.

denote the target object (positive points) or designate the object. Shi-Tomasi sampling follows closely, indicating
background and non-target objects (negative points). Users their respective strengths in this context. The selection and
can manually and interactively provide query points, or they arrangement of these points considerably affect the overall
may be derived from a ground truth mask. For example, in video segmentation performance, thus determining the op-
the case of semi-supervised video object segmentation, the timal method is crucial.
ground truth mask is provided for the first frame where the
object appears. We derive the query points from ground 2) Point Tracking. Initiated with the query points, we
truth masks using different point sampling techniques by employ robust point trackers to propagate the points across
considering their geometrical locations or feature dissimi- all frames in the video, resulting in point trajectories and oc-
larities, as depicted in Fig. 3. These sampling techniques clusion scores. We adopt the state-of-the-art point tracker
are: PIPS [11] to propagate the points as PIPS shows moder-
• Random Sampling: An intuitive approach where ate robustness toward long-term tracking challenges such
query points are randomly selected from the ground as object occlusion and re-appearance. This is also shown
truth mask. more effective than methods such as chained optical flow
• K-Medoids Sampling: This technique takes the clus- propagation or first-frame correspondences in our experi-
ter centers of K-Medoids clustering [26] as query ment section.
points to ensure good coverage of different parts of the
object and robustness to noise and outliers. 3) Segmentation. In the predicted trajectories, the non-
• Shi-Tomasi Sampling: This method extracts Shi- occluded points serve as indicators of where the target ob-
Tomasi corner points from the image under the mask as ject is throughout the video. This allows us to use the non-
they have been shown to be good features to track [29]. occluded points to prompt SAM, as illustrated in Fig. 4,
• Mixed Sampling: A hybrid method combining the and leverage its inherent generalization ability to output
above techniques since it might benefit from the unique per-frame segmentation mask predictions. Unlike conven-
strengths of each. tional tracking methods that require training or fine-tuning
While each method contributes distinct characteristics on video segmentation data, our approach excels in zero-
that influence the model’s performance, our ablation study shot video segmentation tasks.
reveals that K-Medoids sampling yields the best results with We combine positive and negative points by calling SAM
a good full coverage of various segments of the complete in two passes. In the initial pass, we prompt SAM exclu-

4
4) Point Tracking Reinitialization. We optionally exe-
cute a reinitialization of the query points using the predicted
masks once a prediction horizon of h = 8 frames is reached,
and denote the variant as SAM-PT-reinit. Upon reaching
this horizon, we have h predicted masks and will take the
last predicted mask to sample new points. At this stage,
RGB Mask Random all previous points are discarded and substituted with the
newly sampled points. Following this, steps 1) through 4)
are repeated with the new points, starting from the horizon
timestep where reinitialization occurs. The steps are itera-
tively executed until the entire video is processed. The reini-
tialization process serves to enhance tracking accuracy over
time by discarding points that have become unreliable or
K-Medoids Shi-Tomasi Mixed occluded, while incorporating points from object segments
that become visible later in the video. Other reinitialization
Figure 3. Positive Point Sampling. For an image paired with variants are discussed in Appendix A and included in the
either a ground truth or predicted segmentation mask, positive ablation study in Sec. 4.3.
points are sampled from within the mask area using one of the
following point sampling methods: Random, K-Medoids [26], 3.3. SAM-PT vs. Object-centric Mask Propagation
Shi-Tomasi [29], or Mixed. Notably, Random Sampling and K-
Medoids Sampling only require the segmentation mask for input, With sparse point tracking combined with prompting
not the corresponding input image. For negative points, we always SAM, SAM-PT distinguishes itself from traditional video
use Mixed Sampling on the target object’s background mask. segmentation methods that depend on dense object mask
propagation, as noted in Tab. 1. To propagate the first-
frame GT label to the remaining video frames, traditional
Image Mask Mask techniques commonly use feature matching with masks
Encoder Decoder Decoder cached to a mask memory [4,6,40,41], frame-by-frame fea-
ture matching [2, 12], feature matching with the first-frame
mask [45], optical flow [39], and, recently, in-context vi-
Prompt Step 3) Prompt
Encoder Encoder sual prompting [35, 36]. In contrast, SAM-PT introduces a
SAM
unique approach to video object segmentation, employing
the robust combination of point tracking with SAM, which
Figure 4. Interacting with SAM in SAM-PT. In the first pass,
is inherently designed to operate on sparse point prompts.
SAM is prompted exclusively with positive points to define the
object’s initial localization. In the second pass, both positive and The point propagation strategy of SAM-PT offers several
negative points along with the previous mask prediction are fed to advantages over traditional object-centric tracking meth-
the same mask decoder for further mask refinement. The negative ods. First, point propagation exploits local structure con-
points remove segments from the background and neighboring ob- text that is agnostic to global object semantics. This en-
jects and notably help in cases when the point tracker mistakenly hances our model’s capability for zero-shot generalization,
predicts positive points off the target object. The second pass is an advantage that, coupled with SAM’s inherent general-
repeated iteratively to get a refined segmentation mask. ization power, allows for tracking diverse objects in diverse
environments, such as on the UVO benchmark. Second,
SAM-PT allows for a more compact object representation
with sparse points, capturing enough information to char-
sively with positive points to define the object’s initial lo- acterize the object’s segments/parts effectively. Finally, the
calization. Subsequently, in the second pass, we prompt use of points is naturally compatible with SAM, an image
SAM with both positive and negative points along with the segmentation foundation model trained to operate on sparse
previous mask prediction. Negative points provide a more point prompts, offering an integrated solution that aligns
nuanced distinction between the object and the background well with the intrinsic capacities of the underlying model.
and help by removing wrongly segmented areas.
Comparing SAM-PT with conventional methods in
Lastly, we execute a variable number of mask refine- Tab. 1, SAM-PT emerges as superior or comparable to
ment iterations by repeating the second pass. This utilizes methods that refrain from utilizing video segmentation data
SAM’s capacity to refine vague masks into more precise during training. However, there is a performance gap that
ones. Based on our ablation study, this step notably im- exists between such methods and those that leverage video
proves video object segmentation performance. segmentation training data in the same domain, such as

5
Table 1. Comparison of semi-supervised Video Object Seg- stead of a segmentation mask, or a set of points as is the
mentation (VOS) methods with respect to mask annotation re- case for our method. We evaluate our method on four VOS
quirements and propagation techniques. Our method, SAM- datasets: DAVIS 2016, DAVIS 2017 [27], YouTube-VOS
PT, is the first method for VOS that uses sparse point propaga-
2018 [38], and MOSE 2023 [8].
tion. With such compact mask representation, we achieve the high-
est J &F scores on the DAVIS 2016 and 2017 validation subsets
among methods that do not utilize any video segmentation data DAVIS 2016 [27]. DAVIS 2016 is a single-object VOS
during training. The methods are compared based on their use of benchmark, consisting of 20 highly diverse video se-
video mask data during training, whether they are evaluated zero- quences, each of which possesses well-annotated segmen-
shot on DAVIS, what first-frame labels they require, and what label tation masks.
propagation technique they employ.
DAVIS 2017 [27]. A multi-object extension of its 2016
Video Zero- Frame DAVIS DAVIS
Method Propagation version, DAVIS 2017 includes 60 videos in the training set
Mask Shot Init. 2016 2017
SiamMask [33] ✓ ✗ Box Feature Correlation 69.8 56.4 and 30 videos in the validation set, comprising a total of 197
QMRA [19] ✓ ✗ Box Feature Correlation 85.9 71.9
TAM [40] ✓ ✗ Points Feature Matching 88.4 - different objects. The video scenarios within this dataset are
SAM-Track [6] ✓ ✗ Points Feature Matching 92.0 - small but diverse.
XMem [4] ✓ ✗ Mask Feature Matching 92.0 87.7
DeAOT [41] ✓ ✗ Mask Feature Matching 92.9 86.2
Painter [35]
STC [12]




Mask
Mask
Mask Prompting
Feature Matching
-
-
34.6
67.6
YouTube-VOS 2018 [38]. YouTube-VOS 2018 is a
DINO [2] ✗ ✓ Mask Feature Matching - 71.4 large-scale dataset collected from YouTube, comprising
PerSAM-F [45] ✗ ✓ Mask Feature Matching - 71.9
SegGPT [36] ✗ ✓ Mask Mask Prompting 82.3 75.6 3471 training videos encompassing 65 categories and 474
SAM-PT (ours) ✗ ✓ Points Points Prompting 83.1 76.6 validation videos with an additional 26 unseen categories.
The diversity in categories and the inclusion of seen and
unseen classes allow for a comprehensive evaluation of a
XMem [4] or DeAOT [41]. Further, the potential of our given model’s generalization capability.
model extends beyond video object segmentation to other
tasks, such as Video Instance Segmentation (VIS), thanks MOSE 2023 [8]. MOSE 2023 is a recently introduced
to the inherent flexibility of our point propagation strategy. dataset designed for multiple object segmentation and track-
In summary, SAM-PT is the first method that introduces ing in complex scenes. This dataset is replete with chal-
sparse point propagation combined with prompting a im- lenges such as the transient visibility of objects, the pres-
age segmentation foundation model to perform zero-shot ence of minute or less noticeable entities, extensive occlu-
video object segmentation. It provides a fresh perspective sions, and scenes with a high object density. By design,
and adds a new dimension to the study of video object seg- each video in this dataset must contain multiple objects so
mentation. that occlusions must be present, and objects must show suf-
ficient motion, as opposed to being stationary or showing
4. Experiments little movement.
4.1. Datasets
Metrics. We report the standard evaluation metrics for
In the following subsections, we present an overview of video object segmentation [8, 27, 38], including region sim-
the datasets used in our study. Section 4.1.1 provides a brief ilarity J , contour accuracy F, and their average, J &F.
introduction to the Video Object Segmentation task and out-
lines the specific datasets we utilize for this task. Similarly, 4.1.2 Video Instance Segmentation
Section 4.1.2 discusses the Video Instance Segmentation
task and the dataset associated with it. Video Instance Segmentation (VIS) is a task that combines
object detection, instance segmentation, and object track-
ing across video frames, which aims to identify and seg-
4.1.1 Video Object Segmentation
ment each object instance over the whole video sequence.
Video Object Segmentation (VOS) refers to the process This is a much less explored task compared to VOS but has
of segmenting a specific object across an entire video se- been gaining interest. We evaluate our method on the dense-
quence. Semi-supervised VOS (also known as one-shot video task of the UVO v1.0 [34] dataset.
VOS or semi-automatic VOS) is the primary setting for
VOS on which we evaluate our method. In this setting, the UVO v1.0. The Unidentified Video Objects (UVO)
ground truth object mask of the first frame is provided, and dataset is designed to recognize and segment all objects re-
the task is to predict the masks for subsequent frames. Al- gardless of the categories, even those unseen during train-
ternatively, the first frame label can be a bounding box in- ing, thereby focusing on VIS in the open world. Each video

6
in UVO features on average 12.3 object annotations, a con- SAM-PT-reinit on the MOSE dataset, and likewise hin-
siderable increase from previous datasets having only 2 or dered SAM-PT-reinit on the YouTube-VOS dataset. Conse-
3 objects per video on average. UVO sources its videos quently, iterative refinement was deactivated for these spe-
from the Kinetics-400 [14] dataset and contains three dif- cific datasets. For DAVIS, we additionally report results for
ferent splits: FrameSet, VideoSparseSet, and VideoDens- replacing SAM with HQ-SAM [15] and denote the model
eSet. The VideoDenseSet consists of 3-second clips an- variants as HQ-SAM-PT and HQ-SAM-PT-reinit. The HQ-
notated densely at 30fps and tracked over time. The pri- SAM variants use 3 iterative refinement iterations instead of
mary goal of VideoDenseSet is to study video open-world 12 iterations.
segmentation. Objects identifiable under COCO categories
carry their respective COCO labels, while ambiguous ob-
VOS Evaluation. When evaluating on VOS, we use the
jects or those outside the COCO taxonomy are labeled as
provided ground truth mask for the first frame to sample the
“other”. This meticulous and exhaustive annotation struc-
query points required by our method. Then, we give only
ture makes the VideoDenseSet ideal for research areas that
the sampled points as input to our method, not the mask.
require an understanding of videos in a dense and compre-
For all datasets, we use the full-resolution data and resize it
hensive manner, such as robotics, autonomous driving, and
to the longest side of 1024 to match SAM’s input resolution.
augmented-reality applications.

Metrics. We evaluate our method using standard evalu- VIS Evaluation. For evaluating our method on the VIS
ation metrics in image instance segmentation, adapted for task, we leverage SAM’s automatic mask generation ca-
video instance segmentation [38]. These include Average pacity to generate up to 100 mask proposals for the initial
Precision (AP) and Average Recall (AR) IoU-based met- frame. We then propagate these proposed masks through-
rics. Given that each instance in a video comprises a se- out the entire video sequence using our method. We eval-
quence of masks, unlike image instance segmentation, IoU uate TAM [40], a concurrent method we compare against,
computation is carried out not only in the spatial dimensions in the same manner. Our mask proposal generation process
but also in the temporal dimension. This implies that the is currently simplistic and does not create any proposals for
sum of intersections at every single frame is divided by the subsequent video frames. Consequently, it cannot identify
sum of unions at every single frame. These metrics are gen- objects that emerge in later frames, placing it at a disadvan-
erally computed on a per-category basis and subsequently tage compared to VIS methods that are capable of doing so.
averaged across all categories. However, we work with the Despite this limitation, our approach provides a consistent
class-agnostic version of UVO. platform for comparing zero-shot methods in terms of how
effectively they propagate diverse mask proposals from the
4.2. Implementation Details first frame.
Training Data. For our experiments, we use pre-trained 4.3. Ablation Study
checkpoints provided by the respective authors for both
PIPS [11] and SAM. PIPS is trained exclusively on a syn- We conducted detailed ablation experiments on the
thetic dataset, FlyingThings++ [11], derived from the Fly- DAVIS 2017 validation subset to validate various compo-
ingThings [24] optical flow dataset. This dataset includes nents and designs of SAM-PT. We employed SAM’s ViT-H
multi-frame amodal trajectories with synthetic occlusions as the backbone, for all tests. Each aspect was examined
caused by moving objects. SAM, on the other hand, has sequentially, integrating the optimal settings obtained from
been trained on the large-scale SA-1B dataset, the largest prior experiments. To ensure statistical soundness, multiple
image segmentation dataset to date, with over 1 billion iterations of each experiment were carried out (between 4
masks on 11M licensed and privacy-respecting images. It and 12 runs per setup), with findings represented as mean
is noteworthy that neither of these datasets includes video and standard deviation across these runs.
segmentation data, and they do not overlap with any of our While these results provide insight, there may be a risk
evaluation data. This effectively positions our model in a of overfitting due to our limited validation dataset. While
zero-shot video segmentation setting. we endeavored to maintain a consistent evaluation protocol,
future research should aim for a larger validation set, pos-
sibly derived from the YouTube VOS 2018 train dataset, to
Model Variations. Our experiments led to two optimal
mitigate this concern.
model hyperparameters, distinguished as SAM-PT (with-
out reinitialization) and SAM-PT-reinit (with reinitializa-
tion). These configurations were derived from our abla- Query Point Sampling Fig. 5 illustrates that the number
tion study in Sec. 4.3. However, we found that using it- of positive points and the choice of point selection meth-
erative refinement negatively impacted both SAM-PT and ods significantly influence performance. Using 8 points

7
80 Table 2. Point Tracker and SAM Configuration Ablation
32.2 ± 0.7 20.3 ± 0.1 37.1 ± 21.7 29.9 ± 0.9
positive points
1

Study Results. Using the best parameters from the ablation study
72.3 ± 1.2 72.0 ± 0.3 70.5 ± 1.4 70.6 ± 0.8
8

in Fig. 5, we report the mean performance (with standard devia-


70
71.4 ± 0.2 66.6 ± 0.4 70.0 ± 1.1 70.0 ± 0.7 tion) on the validation subset of DAVIS 2017 to study the impact
72 16

58.0 ± 0.2 54.4 ± 0.3 62.6 ± 0.4 62.8 ± 0.5 of (a) different point trackers, (b) the number of negative points
60 per mask, (c) the use of iterative refinement and (d) patch similar-
kmedoids shi-tomasi random mixed ity filtering. We find that the best configuration uses PIPS as the
point selection method
point tracker, 1 negative point per mask, 12 refinement iterations,
Figure 5. Query Point Sampling Ablation Study. We report and no patch similarity filtering. PT: point tracker. NP: negative
mean J &F scores along with the standard deviation on the val- points per mask. IRI: iterative refinement iterations. PS: point
idation subset of DAVIS 2017 for different number of positive similarity filtering threshold.
points per mask and different point selection methods. Using 8
SAM-PT Configuration DAVIS 2017 Validation [27]
positive points per mask leads to a 40-point performance boost
PT NP IRI PS J &F J F Gain
compared to using a single point. Given 8 positive points, K-
(a) point tracker
Medoids and Shi-Tomasi perform comparably well as point se- RAFT [30] 0 0 ✗ 63.0 ± 0.6 60.7 ± 0.6 65.4 ± 0.5
lection methods. SuperGlue [28] 0 0 ✗ 21.7 ± 2.8 19.6 ± 2.1 23.8 ± 3.4
SuperGlue [28] 0 3 ✗ 28.4 ± 3.1 24.7 ± 2.4 32.0 ± 3.8
TapNet [9] 0 0 ✗ 60.9 ± 0.2 58.2 ± 0.3 63.5 ± 0.2
PIPS [11] 0 0 ✗ 72.3 ± 1.2 70.4 ± 1.3 74.3 ± 1.1 +9.3
per mask showed a remarkable 40-point performance en- (b) negative points per mask

hancement compared to a single point. This substantiates PIPS 0 0 ✗ 72.3 ± 1.2 70.4 ± 1.3 74.3 ± 1.1
PIPS 1 0 ✗ 74.1 ± 0.7 72.1 ± 0.6 76.1 ± 0.7 +1.8
the argument that a single positive point is inadequate for PIPS 8 0 ✗ 74.0 ± 0.8 71.9 ± 0.8 76.0 ± 0.9
PIPS 16 0 ✗ 73.4 ± 0.6 71.4 ± 0.6 75.3 ± 0.6
prompting SAM as it often results in the segmentation of PIPS 72 0 ✗ 72.2 ± 0.4 70.3 ± 0.4 74.0 ± 0.4
partial objects only. Among the point selection methods, (c) iterative refinement iterations
K-Medoids and Shi-Tomasi produced comparable results, PIPS 1 0 ✗ 74.1 ± 0.7 72.1 ± 0.6 76.1 ± 0.7
PIPS 1 1 ✗ 75.7 ± 0.7 73.4 ± 0.7 78.1 ± 0.6
with a slight preference towards K-Medoids owing to its PIPS 1 3 ✗ 76.0 ± 0.6 73.4 ± 0.7 78.6 ± 0.7
marginally higher mean score and resilience to the number PIPS 1 12 ✗ 76.3 ± 0.6 73.6 ± 0.6 78.9 ± 0.6 +2.2

of positive points per mask. (d) patch similarity filtering


PIPS 1 12 ✗ 76.3 ± 0.6 73.6 ± 0.6 78.9 ± 0.6 none
PIPS 1 12 0.002 72.7 ± 2.0 70.2 ± 1.8 75.2 ± 2.1
PIPS 1 12 0.01 70.7 ± 2.0 68.3 ± 1.8 73.2 ± 2.1
Point Tracking. Tab. 2a shows that PIPS [11] demon-
strated superior performance over TapNet [9], Super-
Glue [28], and RAFT [30]. TapNet’s limitations stem method for sampling negative points which amounts to us-
from its lack of effective time consistency and its train- ing random sampling when there is only one negative point
ing on 256x256 images, which hampered its performance per mask.
with higher-resolution images. SuperGlue, while proficient
in matching sparse features across rigid scenes, grapples
with effectively matching points from the reference frame Iterative Refinement. The iterative refinement approach
in dynamic scenes, particularly under object deformations. contributed to higher-quality masks and mitigated the im-
RAFT, being an optical flow model, faced difficulties han- pact of artifacts in SAM’s output. Tab. 2c displays that
dling occlusions. Although PIPS’s prior use in our ex- this yielded an improvement of 2.2 points over the non-
periments may have offered some hyperparameter advan- refinement approach.
tages, its superior performance is primarily attributable to
its more robust design that emphasizes trajectory model- Patch Similarity. Our initial findings in Tab. 2d suggest
ing over eight subsequent frames. This approach fosters that using patch similarity to filter unreliable tracking points
the generation of coherent point trajectories and enhances was overly restrictive in our context, leading to substan-
occlusion detection. tial deletion of points. Although it did not prove benefi-
cial in our current setup, this aspect certainly warrants fur-
Negative Points. Tab. 2b highlights that incorporating ther exploration, particularly in scenarios involving point
negative points had a favorable impact, particularly in re- re-initialization.
ducing segmentation errors when points deviated from the
target object. The addition of negative points empowered Reinitialization. Fig. 6 presents the performance of dif-
SAM to better handle the point trackers’ failure cases, lead- ferent reinitialization variants. In Tab. 5 and Tab. 6, we also
ing to improved segmentation and a 1.8-point enhancement show it brings 2.5 and 2.0 points improvements on MOSE
over the non-use of negative points. Note that throughout and UVO benchmarks respectively. The re-initialization
all experiments, we always used the mixed point sampling process enhanced robustness against points falling off ob-

8
77 Table 3. Quantitative results in semi-supervised VOS on the vali-
reinitialization variant

none 76.3 ± 0.6


dation subset of DAVIS 2017.
A 75.7 ± 0.7 74.9 ± 0.9 76.6 ± 0.8 76.6 ± 0.7 76
B 75.8 ± 0.6 76.0 ± 1.1 76.1 ± 0.4 74.8 ± 0.8 DAVIS 2017 Validation [27]
Method J &F J F
C 75.5 ± 0.7 75.1 ± 0.6 75.4 ± 0.6 75.0 ± 0.4 75
(a) trained on video segmentation data
D 75.4 ± 0.2 75.6 ± 1.5 76.4 ± 0.3 75.2 ± 1.1 AGSS [20] 67.4 64.9 69.9
74 AGAME [13] 70.0 67.2 72.7
0–1 0 – 72 12 – 1 12 – 72 AFB-URR [18] 74.6 73.0 76.1
refinement iterations – negative points STM [25] 81.8 79.2 84.3
SWEM [21] 84.3 81.2 87.4
Figure 6. Reinitialization Ablation Study. Mean J &F scores RDE [17] 86.1 82.1 90.0
(with std. dev.) on the validation subset of DAVIS 2017 for dif- SwinB-DeAOT-L [41] 86.2 83.1 89.2
ferent reinitialization variants and configurations. The best result XMem [4] 87.7 84.0 91.4
is achieved by reinitialization variant A, 12 refinement iterations, (b) not trained on video segmentation data (zero-shot)
and 72 negative points per mask. The reinitialization variants dif- Painter [35] 34.6 28.5 40.8
fer in how the timestep at which the points get reinitialized is cho- DINO [2] 71.4 67.9 74.9
SegGPT [36] 75.6 72.5 78.6
sen, see Appendix A for more details. Although using reinitializa-
tion improves the performance only marginally on the validation PerSAM-F [45] 71.9 69.0 74.8
SAM-PT (ours) 76.3 ± 0.6 73.6 ± 0.6 78.9 ± 0.6
subset, this strategy demonstrates substantial improvement on the SAM-PT-reinit (ours) 76.6 ± 0.7 74.4 ± 0.8 78.9 ± 0.6
MOSE 2023 (Tab. 5) and UVO (Tab. 6) datasets.
HQ-SAM-PT [15] (ours) 77.2 ± 0.5 74.7 ± 0.5 79.8 ± 0.4
HQ-SAM-PT-reinit [15] (ours) 77.0 ± 0.7 74.8 ± 0.8 79.2 ± 0.6

jects. By reinitializing all points based on the current mask


prediction, we account for errors in point tracker outputs by We also outperform PerSAM-F on the YouTube-VOS
discarding incorrect points and starting fresh from the cur- 2018 and MOSE 2023 datasets, achieving mean scores of
rent mask prediction. However, this assumes that we trust 67.0 and 41.0 as shown in Tabs. 4 and 5. However, with dif-
the currently outputted mask, which may not always be the ferent mask training data, our performance falls short when
case and sometimes leads to failures. compared to SegGPT on the two datasets.

In summary, our best-performing SAM-PT model em- Qualitative Analysis. Visualizations of successful video
ploys K-Medoids for point selection with 8 points per mask, segmentation on DAVIS 2017 for SAM-PT and SAM-PT-
PIPS for point tracking, a single negative point per mask, reinit can be seen in Fig. 7a and Fig. 7b respectively. No-
and employs 12 iterations for iterative refinement without tably, Fig. 8 presents successful video segmentation on un-
patch similarity filtering. Meanwhile, using reinitialization seen web videos – clips from the “Avatar: The Last Airben-
achieved optimum performance with 12 refinement itera- der” anime-influenced animated television series, demon-
tions and 72 negative points per mask. strating the zero-shot capabilities of our method.

4.4. Comparison with State-of-the-art Methods Limitations and Challenges. Despite the competitive
zero-shot performance, certain limitations persist, primar-
All reported results were computed with official tools or
ily due to the limitations of our point tracker in handling
official evaluation servers. Sec. 4.4.1 reports Video Ob-
occlusion, small objects, motion blur, and re-identification.
ject Segmentation results, including qualitative results on
In such scenarios, the point tracker’s errors propagate into
unseen web videos. Sec. 4.4.2 reports Video Instance Seg-
future video frames. Fig. 7c illustrates these problematic
mentation results.
instances on DAVIS 2017, while Fig. 9 presents additional
cases on “Avatar: The Last Airbender” clips. Although us-
4.4.1 Video Object Segmentation ing point re-initialization and negative points somewhat al-
leviates the failures of the point tracker, they still prevent
Performance Overview. Our proposed method outper- the performance from being on par with methods trained on
forms others that have not been trained on any video object video data.
segmentation data on the DAVIS 2017 dataset, as reflected
in Tab. 3. A mean J &F score of 76.6 points exceeds the
4.4.2 Video Instance Segmentation
PerSAM-F by 4.7 points and the SegGPT generalist model
by a single point. The experiments were repeated 8 times Results and Analysis. Given the same mask propos-
for statistical robustness, and we report the mean and stan- als, SAM-PT outperforms TAM [40] significantly even
dard deviation of our method’s performance. though SAM-PT was not trained on any video segmentation

9
(a) Good cases for SAM-PT

(b) Good cases for SAM-PT-reinit

(c) Failure cases for SAM-PT

Figure 7. Visualization of SAM-PT on the DAVIS 2017 Validation [27] dataset. Given the first-frame masks, we sample 8 positive
points and either 1 (SAM-PT) or 72 (SAM-PT-reinit) negative points per object to initialize SAM-PT and SAM-PT-reinit. Circles denote
positive points, crosses denote negative points, and red symbols (circle and cross) denote that the point was predicted to be occluded.

10
Figure 8. Successful segmentation using SAM-PT on short clips from “Avatar: The Last Airbender”. Although our method has never
seen data from Avatar, an anime-influenced animated television series, it segments and tracks various objects in short clips.

Figure 9. Challenging scenarios for SAM-PT on short clips from “Avatar: The Last Airbender”. These cases illustrate instances
where our model struggles when faced with point tracking failures that are the result of incorrectly predicting the point at a similar-looking
segment or when faced with object occlusions and disappearing objects.

data. TAM is a concurrent approach combining SAM and


XMem [4], where XMem was pre-trained on BL30K [5]
and trained on DAVIS and YouTube-VOS, but not on UVO.
On the other hand, SAM-PT combines SAM with the PIPS
point tracking method, both of which have not been trained
on video segmentation tasks.

11
Table 4. Quantitative results in semi-supervised VOS on the val- Table 6. Results on the validation split of UVO [34] Video-
idation subset of YouTube-VOS 2018. Metrics are reported sep- DenseSet v1.0. SAM-PT outperforms TAM [40] even though the
arately for “seen” and “unseen” classes, with G being the overall former was not trained on any video segmentation data. TAM is a
average score over the metrics. Note that SegGPT and SAM-PT concurrent approach combining SAM [16] and XMem [4], where
adopt completely different training data. ♠: our reproduced result XMem was pre-trained on BL30K [5] and trained on DAVIS [27]
using the official code of [45]. and YouTube-VOS [38], but not on UVO. On the other hand,
SAM-PT combines SAM with the PIPS point tracking method,
YouTube-VOS 2018 Validation [38] both of which have not been trained on any video segmentation
Method G Js Fs Ju Fu tasks.
(a) trained on video segmentation data
AGAME [13] 66.0 66.9 - 61.2 - Method Propagation AR100 ARs ARm ARl AP
AGSS [20] 71.3 71.3 65.5 75.2 73.1
(a) trained on video segmentation data, including UVO’s training subset
STM [25] 79.4 79.7 84.2 72.8 80.9
Mask2Former VIS [44] N/A 35.4 − − − 27.3
AFB-URR [18] 79.6 78.8 83.1 74.1 82.6 ROVIS [44] N/A 41.2 − − − 32.7
RDE [17] 83.3 81.9 86.3 78.0 86.9
(b) trained on video segmentation data
SWEM [21] 82.8 82.4 86.9 77.1 85.0
XMem [4] 86.1 85.1 89.8 80.3 89.2 TAM [40] Feature Matching 24.1 21.1 32.9 31.1 1.7
SwinB-DeAOT-L [41] 86.2 85.6 90.6 80.0 88.4 (c) not trained on video segmentation data (zero-shot)

(b) not trained on video segmentation data (zero-shot) SAM-PT (ours) Points Prompting 28.8 ± 0.1 23.3 ± 0.1 40.8 ± 0.2 48.3 ± 0.6 6.7 ± 0.2
SAM-PT-reinit (ours) Points Prompting 30.8 25.1 44.1 49.2 6.5
Painter [35] 24.1 27.6 35.8 14.3 18.7
SegGPT [36] 74.7 75.1 80.2 67.4 75.9
PerSAM-F ♠ [45] 54.4 53.9 56.4 50.7 56.6
SAM-PT (ours) 67.0 ± 0.3 68.6 ± 0.2 71.2 ± 0.1 61.0 ± 0.5 67.4 ± 0.4 ing to video labeling. Furthermore, the future incorpora-
SAM-PT-reinit (ours) 67.5 ± 0.2 69.0 ± 0.4 69.9 ± 0.3 63.2 ± 0.4 67.8 ± 0.5
tion of more advanced point trackers can enhance the per-
formance of SAM-PT.
Table 5. Quantitative results in semi-supervised VOS on the vali-
dation subset of MOSE 2023 [8]. ♠: our reproduced result using
References
the official code of [45].
[1] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc
MOSE 2023 Validation [8] Van Gool. Speeded-up robust features (surf). Computer vi-
Method J &F J F sion and image understanding, 110(3):346–359, 2008. 2
(a) trained on video segmentation data
[2] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
RDE [17] 48.8 44.6 52.9 ing properties in self-supervised vision transformers. In
SWEM [21] 50.9 46.8 54.9 ICCV, 2021. 2, 3, 5, 6, 9
XMem [4] 57.6 53.3 62.0
DeAOT [41] 59.4 55.1 63.8 [3] Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexan-
der Kirillov, Rohit Girdhar, and Alexander G. Schwing.
(b) not trained on video segmentation data (zero-shot) Mask2former for video instance segmentation. arXiv
Painter [35] 14.5 10.4 18.5 preprint arXiv: 2112.10764, 2021. 1, 2
SegGPT [36] 45.1 42.2 48.0 [4] Ho Kei Cheng and Alexander G Schwing. Xmem: Long-
PerSAM-F ♠ [45] 23.3 19.8 26.8 term video object segmentation with an atkinson-shiffrin
SAM-PT (ours) 38.5 ± 0.2 34.9 ± 0.3 42.1 ± 0.2 memory model. In ECCV, 2022. 1, 2, 3, 5, 6, 9, 11, 12
SAM-PT-reinit (ours) 41.0 ± 0.5 38.5 ± 0.5 43.5 ± 0.5 [5] Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Modular
interactive video object segmentation: Interaction-to-mask,
propagation and difference-aware fusion. In CVPR, 2021.
11, 12
5. Conclusion
[6] Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li,
We present SAM-PT, an innovative solution that ex- Zongxin Yang, Wenguan Wang, and Yi Yang. Segment and
tends SAM’s segmentation ability from static images to dy- track anything. arXiv preprint arXiv:2305.06558, 2023. 2,
3, 5, 6
namic videos. Integrated with long-term point trackers, our
[7] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-
approach demonstrates strong performance across several
novich. Superpoint: Self-supervised interest point detection
benchmarks including DAVIS, YouTube-VOS, MOSE, and
and description. In CVPRW, 2018. 2
UVO. While our method has limitations such as difficulty
[8] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip
handling occlusions, small objects, and motion blur, and in- H. S. Torr, and Song Bai. Mose: A new dataset for video ob-
consistencies in mask predictions, it contributes a simple ject segmentation in complex scenes. arXiv preprint arXiv:
and effective new point-based perspective to video object 2302.01872, 2023. 6, 12
segmentation research. By illustrating a promising way to [9] Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Re-
extend foundational models like SAM into the video do- casens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew
main, our research provides a potential pathway for ad- Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking
vancements in diverse applications from autonomous driv- any point in a video. In NeurIPS, 2022. 2, 3, 8

12
[10] Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush [26] Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algo-
Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. rithm for k-medoids clustering. Expert Systems with Appli-
Tapir: Tracking any point with per-frame initialization and cations, 36(2, Part 2):3336–3341, 2009. 4, 5
temporal refinement. arXiv preprint arXiv: 2306.08637, [27] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar-
2023. 2 beláez, Alexander Sorkine-Hornung, and Luc Van Gool.
[11] Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. The 2017 davis challenge on video object segmentation.
Particle video revisited: Tracking through occlusions using arXiv:1704.00675, 2017. 6, 8, 9, 10, 12, 14
point trajectories. In ECCV, 2022. 2, 3, 4, 7, 8 [28] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz,
[12] Allan Jabri, Andrew Owens, and Alexei Efros. Space-time and Andrew Rabinovich. Superglue: Learning feature
correspondence as a contrastive random walk. In NeurIPS, matching with graph neural networks. In CVPR, 2020. 2,
2020. 2, 3, 5, 6 8
[13] Joakim Johnander, Martin Danelljan, Emil Brissman, Fa- [29] Jianbo Shi and Tomasi. Good features to track. In CVPR,
had Shahbaz Khan, and Michael Felsberg. A generative ap- 1994. 2, 4, 5
pearance model for end-to-end video object segmentation. In [30] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field
CVPR, 2019. 9, 12 transforms for optical flow. In ECCV, 2020. 2, 8
[14] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, [31] Carlo Tomasi and Takeo Kanade. Detection and tracking of
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, point. IJCV, 9:137–154, 1991. 2
Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, [32] Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi
and Andrew Zisserman. The kinetics human action video Li, Bharath Hariharan, Aleksander Holynski, and Noah
dataset. arXiv preprint arXiv: 1705.06950, 2017. 7 Snavely. Tracking everything everywhere all at once.
[15] Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing arXiv:2306.05422, 2023. 2
Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in [33] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and
high quality. arXiv preprint arXiv: 2306.01567, 2023. 3, 7, Philip HS Torr. Fast online object tracking and segmentation:
9 A unifying approach. In CVPR, 2019. 6
[16] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, [34] Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran.
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- Unidentified video objects: A benchmark for dense, open-
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- world segmentation. In ICCV, 2021. 2, 6, 12
thing. arXiv preprint arXiv:2304.02643, 2023. 1, 2, 3, 12 [35] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and
[17] Mingxing Li, Li Hu, Zhiwei Xiong, Bang Zhang, Pan Pan, Tiejun Huang. Images speak in images: A generalist painter
and Dong Liu. Recurrent dynamic embedding for video ob- for in-context visual learning. In CVPR, 2023. 2, 3, 5, 6, 9,
ject segmentation. In CVPR, 2022. 9, 12 12
[36] Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang,
[18] Yongqing Liang, Xin Li, Navid Jafari, and Jim Chen.
Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting ev-
Video object segmentation with adaptive feature bank and
erything in context. arXiv preprint arXiv:2304.03284, 2023.
uncertain-region refinement. NeurIPS, 2020. 9, 12
2, 3, 5, 6, 9, 12, 14
[19] Fanchao Lin, Hongtao Xie, Yan Li, and Yongdong Zhang.
[37] Junfeng Wu, Yi Jiang, Wenqing Zhang, Xiang Bai, and Song
Query-memory re-aggregation for weakly-supervised video
Bai. Seqformer: a frustratingly simple model for video in-
object segmentation. In AAAI, 2021. 6
stance segmentation. In ECCV, 2022. 1
[20] Huaijia Lin, Xiaojuan Qi, and Jiaya Jia. Agss-vos: Atten- [38] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen
tion guided single-shot video object segmentation. In ICCV, Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A
2019. 9, 12 large-scale video object segmentation benchmark, 2018. 6,
[21] Zhihui Lin, Tianyu Yang, Maomao Li, Ziyu Wang, Chun 7, 12
Yuan, Wenhao Jiang, and Wei Liu. Swem: Towards real- [39] Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman,
time video object segmentation with sequential weighted and Weidi Xie. Self-supervised video object segmentation
expectation-maximization. In CVPR, 2022. 9, 12 by motion grouping. In ICCV, 2021. 5
[22] David G Lowe. Distinctive image features from scale- [40] Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing
invariant keypoints. IJCV, 60:91–110, 2004. 2 Wang, and Feng Zheng. Track anything: Segment anything
[23] Bruce D Lucas and Takeo Kanade. An iterative image reg- meets videos. arXiv preprint arXiv:2304.11968, 2023. 2, 3,
istration technique with an application to stereo vision. In 5, 6, 7, 9, 12
IJCAI, 1981. 2 [41] Zongxin Yang and Yi Yang. Decoupling features in hi-
[24] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, erarchical propagation for video object segmentation. In
Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A NeurIPS, 2022. 1, 2, 5, 6, 9, 12
large dataset to train convolutional networks for disparity, [42] Zongxin Yang and Yi Yang. Decoupling features in hi-
optical flow, and scene flow estimation. In CVPR, 2016. 7 erarchical propagation for video object segmentation. In
[25] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo NeurIPS, 2022. 3
Kim. Video object segmentation using space-time memory [43] K. M. Yi, Eduard Trulls, Vincent Lepetit, and P. Fua. Lift:
networks. In ICCV, 2019. 9, 12 Learned invariant feature transform. ECCV, 2016. 2

13
[44] Zitong Zhan, Daniel McKee, and Svetlana Lazebnik. Robust Table 7. Quantitative results in semi-supervised video object seg-
online video instance segmentation with track queries. arXiv mentation on additional subsets of DAVIS. ♠: our reproduced re-
preprint arXiv: 2211.09108, 2022. 12 sult using the official code of [45].
[45] Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junt-
DAVIS 2016 Validation [27] DAVIS 2017 Test-dev [27]
ing Pan, Hao Dong, Peng Gao, and Hongsheng Li. Person- J &F J F J &F J F
alize segment anything model with one shot. arXiv preprint PerSAM-F ♠ [45] 74.8 74.5 75.0 47.6 45.5 49.7
SegGPT [36] 82.3 81.8 82.8 − - -
arXiv:2305.03048, 2023. 2, 3, 5, 6, 9, 12, 14 SAM-PT 83.1 ± 1.5 83.0 ± 0.8 83.0 ± 1.1 62.7 ± 0.5 59.4 ± 0.6 66.1 ± 0.4
SAM-PT-reinit 80.2 ± 0.6 80.3 ± 0.6 80.0 ± 0.6 61.5 ± 1.1 59.3 ± 1.0 63.8 ± 1.2

A. Point Tracking Reinitialization


In our SAM-PT-reinit method, we introduce a reinitial- than methods without reinitialization.
ization strategy. Here, the point tracker begins anew af- From our investigations, we found the (A) Reinit-on-
ter every h frames, where h represents a pre-set track- Horizon-and-Sync-Masks strategy to be the most effec-
ing horizon (e.g., 8 frames), or is dynamically determined tive, as indicated by its superior performance on the DAVIS
based on SAM’s mask predictions for each timestep within 2017 validation subset. The choice of reinitialization
the horizon (e.g., using most-similar-mask-area heuristics). method may depend on the specific validation subset and
Upon reaching this horizon, the query points given to the the degree of hyperparameter tuning involved. Note that we
tracker are reinitialized according to the mask prediction have always used reinitialization along with negative points.
SAM outputted at the horizon frame. While this method A.1. Computational Cost and Speed Optimization
may increase the computational load (especially if some of
SAM’s computed masks are disregarded), it demonstrates The introduction of reinitialization in SAM-PT-reinit
substantial performance improvement in demanding video comes with a trade-off: it slows down the inference speed
sequences, such as those in the MOSE dataset. by a factor of 2 to 8, depending on the reinitialization
We explored four reinitialization methods, each varying method and parameters used. The major bottleneck is
in how they compute the value of h: the invocation of SAM’s backbone for each video frame.
(A) Reinit-on-Horizon-and-Sync-Masks: This straight- We propose caching the backbone outputs for unprocessed
forward variant reinitializes points after a fixed num- video frames as a possible solution to mitigate this slow-
ber of frames (e.g., every 8 frames). However, it may down. This strategy requires storing embeddings for all
stumble if the mask is absent at the reinitialization video frames in the working memory but offers the potential
timestep. Despite this potential pitfall, it operates at for significant speedup, particularly useful for applications
the same speed as methods that do not employ reini- requiring faster inference.
tialization.
(B) Reinit-at-Median-of-Area-Diff: In this variant, the B. More DAVIS Subsets
tracker outputs trajectory points for each frame within We report results on DAVIS 2016 Validation and DAVIS
the horizon, and SAM predicts masks based on these 2017 Test-dev in Tab. 7.
trajectories. Reinitialization happens at the frame
within the horizon that has the mean mask area among C. Per-sequence DAVIS 2017 Validation Re-
the non-empty masks predicted by SAM. Notably, this
approach may be significantly slower than methods
sults
without reinitialization, as it may reject several SAM See figure Fig. 10 for per-sequence DAVIS 2017 Val-
masks (e.g., out of 8 computed masks, reinitialization idation results. For exact numbers and tables, check our
might occur on the second one, necessitating recom- GitHub experiment summaries or the Wandb project.
putation of the remaining 6 masks in the next step).
(C) Reinit-on-Similar-Mask-Area: This method triggers
reinitialization when the mask area is similar to the ini-
tial mask area, causing it to be several times slower
than methods without reinitialization.
(D) Reinit-on-Similar-Mask-Area-and-Sync-Masks:
This variant reinitializes when the mask area for
all masks in the batch is similar to the initial mask
areas, synchronizing the masks to be tracked from
the same timestep. This synchronization allows for
the use of negative points from other masks when
querying SAM, but it also runs several times slower

14
J-Mean J-Mean

0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
bik bik
bike-pa bike-pa
e- ck e- ck
blapacking 1 blapacking 1
i i
bmckswng 2 bmckswng 2
bmx-tr an 1 bmx-tr an 1
bre x-t ees bre x-t ees
ak ree 1 ak ree 1
car d car d

mixed
-ro c ances 2 -ro c ances 2
un am 1 un am 1

random
kmedoids
car dab el car dab el

shi-tomasi
-sh ou 1 -sh ou 1
ad t 1 ad t 1
da o da o

point selection method


nce coww 1 nce coww 1
-tw s 1 -tw s 1
do ir do ir
d l1 d l1
dogs-ju og 1 dogs-ju og 1
g
d s- p m g
d s- p m
dr ogs jum 1 dr ogs jum 1
driift-ch-jump 2 driift-ch-jump 2
ft- ica p 3 ft- ica p 3
str ne str ne
aig 1 aig 1
h h
gol goat 1 gol goat 1
d
gol -fis t 1 d
gol -fis t 1
gold-fish 1 gold-fish 1
ho gold-fish 2 ho gold-fish 2
r d- h r d- h
s
ho ejugold-fish 3 s
ho ejugold-fish 3
rse mp fis 4 rse mp fis 4
jum -h h 5 jum -h h 5
p-high p-high
i 1 i 1
indgh 2 indgh 2
india 1 india 1
india 2 india 2
judia 3 judia 3
kit judo 1 kit judo 1
e e

15
kit -suro 2 kit -suro 2
kite-surf 1 kite-surf 1
labe-surf 2 labe-surf 2

sequence
sequence
lab-coa f 3 lab-coa f 3
lab-coat 1 lab-coat 1
lab-coat 2 lab-coat 2

(b) point selection methods


lab-coat 3 lab-coat 3
-co t 4 -co t 4
at at

(a) number of positive points per mask


loalibby 5 loalibby 5
d
loa ing 1 d
loa ing 1
mb loading 1 mb loading 1
mo mbike-tding 2 mo mbike-tding 2
to ik ri 3 to ik ri 3
pa motocros e-trick 1 pa motocros e-trick 1
paraglidcross-jumck 2 paraglidcross-jumck 2
paragliding-s-jump 1 paragliding-s-jump 1
rag in lau p rag in lau p
lid g-l nc 2 lid g-l nc 2
ing aun h 1 ing aun h 1
-la ch -la ch
pa unch 2 pa unch 2
rko 3 rko 3
u u
pigr 1 pigr 1
sco pigs 1 sco pigs 1
scooter pigs 2 scooter pigs 2
ote -bla s 3 ote -bla s 3
r c r c
sho-black 1 sho-black 1
shootink 2 shootink 2
shooting 1 shooting 1
soaoting 2 soaoting 2
soapboxg 3 soapboxg 3
soapbox 1 soapbox 1
pb 2 pb 2
ox ox
3 3
8
1

72
16
positive points
J-Mean J-Mean

0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
bik bik
bike-pa bike-pa
e- ck e- ck
blapacking 1 blapacking 1
i i
bmckswng 2 bmckswng 2
bmx-tr an 1 bmx-tr an 1

PIPS
bre x-t ees bre x-t ees

RAFT
ak ree 1 ak ree 1

TapNet
d d

point tracker
car car
-ro c ances 2 -ro c ances 2

8
1
0
un am 1 un am 1

72
16
car dab el car dab el
-sh ou 1 -sh ou 1
ad t 1 ad t 1
da o da o
nce coww 1 nce coww 1

negative points per mask


-tw s 1 -tw s 1
do ir do ir
d l1 d l1
dogs-ju og 1 dogs-ju og 1
g
d s- p m g
d s- p m
dr ogs jum 1 dr ogs jum 1
driift-ch-jump 2 driift-ch-jump 2
ft- ica p 3 ft- ica p 3
str ne str ne
aig 1 aig 1
h h
gol goat 1 gol goat 1
d
gol -fis t 1 d
gol -fis t 1
gold-fish 1 gold-fish 1
ho gold-fish 2 ho gold-fish 2
r d- h r d- h
s
ho ejugold-fish 3 s
ho ejugold-fish 3
rse mp fis 4 rse mp fis 4
jum -h h 5 jum -h h 5
p-high p-high
i 1 i 1
indgh 2 indgh 2
india 1 india 1
india 2 india 2
judia 3 judia 3
kit judo 1 kit judo 1
e e

16
kit -suro 2 kit -suro 2
kite-surf 1 kite-surf 1
labe-surf 2 labe-surf 2

sequence
sequence
lab-coa f 3 lab-coa f 3
lab-coat 1 lab-coat 1

(c) point tracking module


lab-coat 2 lab-coat 2
lab-coat 3 lab-coat 3
-co t 4 -co t 4
at at

(d) number of negative points per mask


loalibby 5 loalibby 5
d
loa ing 1 d
loa ing 1
mb loading 1 mb loading 1
mo mbike-tding 2 mo mbike-tding 2
to ik ri 3 to ik ri 3
pa motocros e-trick 1 pa motocros e-trick 1
paraglidcross-jumck 2 paraglidcross-jumck 2
paragliding-s-jump 1 paragliding-s-jump 1
rag in lau p rag in lau p
lid g-l nc 2 lid g-l nc 2
ing aun h 1 ing aun h 1
-la ch -la ch
pa unch 2 pa unch 2
rko 3 rko 3
u u
pigr 1 pigr 1
sco pigs 1 sco pigs 1
scooter pigs 2 scooter pigs 2
ote -bla s 3 ote -bla s 3
r c r c
sho-black 1 sho-black 1
shootink 2 shootink 2
shooting 1 shooting 1
soaoting 2 soaoting 2
soapboxg 3 soapboxg 3
soapbox 1 soapbox 1
pb 2 pb 2
ox ox
3 3
J-Mean J-Mean

0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
bik bik
bike-pa bike-pa
e- ck e- ck
blapacking 1 blapacking 1
i i
bmckswng 2 bmckswng 2
bmx-tr an 1 bmx-tr an 1
bre x-t ees bre x-t ees
ak ree 1 ak ree 1
car d car d
-ro c ances 2 -ro c ances 2
un am 1 un am 1
car dab el car dab el
-sh ou 1 -sh ou 1
ad t 1 ad t 1
da o da o
nce coww 1 nce coww 1
-tw s 1 -tw s 1
do ir do ir
d l1 d l1
dogs-ju og 1 dogs-ju og 1
g
d s- p m g
d s- p m
dr ogs jum 1 dr ogs jum 1
driift-ch-jump 2 driift-ch-jump 2
ft- ica p 3 ft- ica p 3
str ne str ne
aig 1 aig 1
h h
gol goat 1 gol goat 1
d
gol -fis t 1 d
gol -fis t 1
gold-fish 1 gold-fish 1
ho gold-fish 2 ho gold-fish 2
r d- h r d- h
s
ho ejugold-fish 3 s
ho ejugold-fish 3
rse mp fis 4 rse mp fis 4
jum -h h 5 jum -h h 5
p-high p-high
i 1 i 1
indgh 2 indgh 2
india 1 india 1
india 2 india 2
judia 3 judia 3
kit judo 1 kit judo 1
e e

17
kit -suro 2 kit -suro 2
kite-surf 1 kite-surf 1

95% bootstrapped confidence intervals, based on 1000 bootstrap samples.


labe-surf 2 labe-surf 2

sequence
sequence
lab-coa f 3 lab-coa f 3
lab-coat 1 lab-coat 1
lab-coat 2 lab-coat 2
lab-coat 3 lab-coat 3
-co t 4 -co t 4
at at

(f) patch similarity filtering threshold


loalibby 5 loalibby 5
d
loa ing 1
(e) number of iterative refinement iterations d
loa ing 1
mb loading 1 mb loading 1
mo mbike-tding 2 mo mbike-tding 2
to ik ri 3 to ik ri 3
pa motocros e-trick 1 pa motocros e-trick 1
paraglidcross-jumck 2 paraglidcross-jumck 2
paragliding-s-jump 1 paragliding-s-jump 1
rag in lau p rag in lau p
lid g-l nc 2 lid g-l nc 2
ing aun h 1 ing aun h 1
-la ch -la ch
pa unch 2 pa unch 2
rko 3 rko 3
u u
pigr 1 pigr 1
sco pigs 1 sco pigs 1
scooter pigs 2 scooter pigs 2
ote -bla s 3 ote -bla s 3
r c r c
sho-black 1 sho-black 1
shootink 2 shootink 2
shooting 1 shooting 1
soaoting 2 soaoting 2
3
1
0

12

soapboxg 3 soapboxg 3
soapbox 1 soapbox 1
pb 2 pb 2
ox ox
3 3
none

thr=0.01
thr=0.002
iterative refinement itearations

patch similarity

Figure 10. Per-sequence Ablation Results. The charts plot the median J-Mean result of different ablation result experiments along with

You might also like