Key Track
Key Track
Abstract
16738
point on a pose. Similar to how BERT tokenizes words [14], Method Estimation Detection Improvement Tracking
we propose an embedding scheme for pose data that cap- Ours HRNet Temporal OKS Pose Entailment
HRNet [45] HRNet BBox Prop. Optical Flow
tures spatio-temporal relationships and feed our transformer
Top-Down
POINet [40] VGG, T-VGG - Ovonic Insight Net
network these embeddings. Since these embeddings contain MDPN [20] MDPN Ensemble Optical Flow
information beyond spatial location, our network outper- LightTrack [35] Simple Baselines Ensemble/BBox Prop. GCN
ProTracker [19] 3D Mask RCNN - IoU
forms convolution-based approaches in terms of accuracy
Bottom-Up
Affinity Fields [38] VGG/STFields - STFields
and speed, particularly at very low resolutions. STEmbeddings [28] STEmbeddings - STEmbeddings
Additionally, in order to improve the keypoint estimates JointFlow Siamese CNN - Flow Fields
used by the transformer network, we propose a Temporal Table 1. How different approaches address each step of the Pose
Object Keypoint Similarity (TOKS) method. TOKS refines Tracking problem. Our contributions are in bold.
the pose estimation output by augmenting missed detec-
tions and thresholding low quality estimates using a key-
coordinates of the joints [48] or to generate heatmaps of
point similarity metric. TOKS adds no learned parameters
the probability of a joint’s location [47, 57, 52]. A major-
to the estimation step, and is superior to existing bounding
ity of the convolutional approaches can be classified into
box propagation methods that often rely on NMS and opti-
top-down and bottom-up methods – the top-down meth-
cal flow. KeyTrack makes the following contributions:
ods use a separate detection step to identify person can-
1. KeyTrack introduces Pose Entailment, where a binary
didates [21, 37, 10, 24, 37]. The single person pose esti-
classification is made as to whether two poses from differ-
mation step is then performed on these person candidates.
ent timesteps are the same person. We model this task in a
Bottom-up methods calculate keypoints from all candidates
transformer-based network which learns temporal pose re-
and then correlate these keypoints into individual human
lationships even in datasets with complex motion. Further-
joints [53, 25]. The latter method is more efficient since
more, we present a tokenization scheme for pose informa-
all keypoints are calculated in a single step; however, the
tion that allows transformers to outperform convolutions at
former is more accurate since the object detection step lim-
low spatial resolutions when tracking keypoints.
its the regression boundaries. However, top-down methods
2. KeyTrack introduces a temporal method for improv-
work poorly on small objects and recent work (HRNet) [45]
ing keypoint estimates. TOKS is more accurate than bound-
uses parallel networks at different resolutions to maximize
ing box propagation, faster than a detector ensemble, and
spatial information. PoseWarper [8] uses a pair of labeled
does not require learned parameters.
and unlabeled frames to predict human pose by learning
Using the above methods, we develop an efficient multi-
the pose-warping using deformable convolutions. Finally,
person pose tracking pipeline which sets a new SOTA on the
since the earliest applications of deep learning to pose esti-
PoseTrack test set. We achieve 61.2% tracking accuracy on
mation [48], iterative predictions have improved accuracy.
the PoseTrack’17 Test Set and 66.6% on the PoseTrack’18
Pose estimation has shown to benefit from cascaded predic-
Val set using a model that consists of just 0.43M parameters
tions [10] and pose-refinement methods [17, 34] refine the
in the tracking step. This portion of our pipeline 500X more
pose estimation results of previous stages using a separate
efficient than than the leading optical flow method [45]. Our
post-processing network. In that spirit, our work, KeyTrack
training is performed on a single NVIDIA 1080Ti GPU. Not
relies on HRNet to generate keypoints and refines keypoint
reliant on RGB or optical flow information in the tracking
estimates by temporally aggregating and suppressing low
step, our model is suitable to perform pose tracking using
confidence keypoints with TOKS instead of commonly used
other non-visual pose estimation sensors that only provide
bounding box propagation approaches.
15 keypoints for each person [3].
6739
Figure 2. a) Keypoints are estimated with HRNet. b) TOKS improves detection accuracy. c) Pose pairs are collected from multiple past
timesteps. Poses of the same color have the same track id, the color black indicates the track id is unknown. d) Each pair is tokenized
independently from the other pairs. e) Our Transformer Matching Network calculates match scores independently for each pair. f) The
maximum match score is greedily chosen and the corresponding track id is assigned.
on spatial resolution, and thus more efficient. We can also pose pt,i P P t . First, each of the pose’s k j P K keypoints
model more fine-grained spatio-temporal relationships. are detected. This is done by localizing a bounding box
Among non-learned tracking methods, optical flow prop- around each pose with an object detector and then estimat-
agates poses from one frame to the next to determine which ing keypoint locations in the box. Keypoint predictions are
pose they are most similar to in the next frame [45, 20]. This improved with temporal OKS (TOKS). Please see 3.3 for
improves over spatial consistency, which measures the IoU more details. From here, this pose with no tracking id, pt,iI,
between bounding boxes of poses from temporally adjacent is assigned its appropriate one. This is based on the pose’s
frames [19]. Other methods use graph-partitioning based similarity to a pose in a previous timestep, which has an id,
approaches to group pose tracks [26, 27, 29]. Another pt´δ,j
id . Similarity is measured with the match score, mt´δ,j
id ,
method, PoseFlow [55], uses inter/intra-frame pose distance using Pose Entailment (3.2).
and NMS to construct pose flows. However, our method False negatives are an inevitable problem in keypoint
does not require hard-coded parameters during inference, detection, and hurt the downstream tracking step because
this limits the ability of non-learned methods to model poses with the correct track id may appear to be no longer
scenes with complex motion and requires time-intensive in the video. We mitigate this by calculating match scores
manual tuning. Table 1 shows top-down methods similar for poses in not just one previous frame, but multiple frames
t´d,j
to our work as well as competitive bottom-up methods. tF 1 , F 2 , ... F δ u. Thus, we compare to each pose pid
where 1 ď d ď δ and 1 ď j ď |P t´d |. In practice, we limit
Transformer Models Recently, there have been success- the number of poses we compare to in a given frame to the
ful implementations of transformer-based models for image n spatially nearest poses. This is just as accurate as com-
and video input modalities often substituting convolutions paring to everyone in the frame and bounds our runtime to
and recurrence mechanisms. These methods can efficiently Opδnq. This gives us a set of match scores M, and we as-
model higher-order relationships between various scene ele- sign pt,i
I the track id corresponding to the maximum match
ments unlike pair-wise methods [11, 22, 41, 56]. They have score m˚ “ maxpMq, where id˚ “ m˚id . Thus, we assign
been applied for image classification [39], visual question- the tracking id to the pose, pt,i
id˚ .
answering [30, 31, 46, 60], action-recognition [23, 32],
video captioning [44, 61] and other video problems. Video- 3.2. Pose Entailment
Action Transformer [18] solves the action localization prob- To effectively solve the multi-person pose tracking prob-
lem using transformers by learning the context and interac- lem, we need to understand how human poses move through
tions for every person in the video. BERT [13] uses trans- time based on spatial joint configurations as well as in the
formers by pretraining a transformer-based network in a presence of multiple persons and occluding objects. Hence,
multi-task transfer learning scheme over the unsupervised we need to learn if a pose in timestep t, can be inferred from
tasks of predicting missing words or next sentences. In- timestep t´1. Textual entailment provides us with a similar
stead, in a supervised setting, KeyTrack uses transformers framework in the NLP domain where one needs to under-
to learn spatio-temporal keypoint relationships for the vi- stand if one sentence can be implied from the next. More
sual problem of pose tracking. specifically, the textual entailment model classifies whether
a premise sentence implies a hypothesis sentence in a sen-
3. Method tence pair [9]. The typical approach to this problem con-
sists of first projecting the pair of sentences to an embed-
3.1. Overview of Our Approach
ding space and then feeding them through a neural network
We now describe the keypoint estimation and tracking which outputs a binary classification for the sentence pair.
approach used in KeyTrack as shown in Figure 2. For frame Hence, we propose the Pose Entailment problem. More
F t at timestep t, we wish to assign a track id to the ith formally, we seek to classify whether a pose in a timestep
6740
Figure 3. Orange box: Visualizations to intuitively explain our tokenization. In the Position column, the matching poses are spatially
closer together than the non-matching ones. This is because their spatial locations in the image are similar. The axis limit is 432 because
the image has been downsampled to width ˚ height “ 432. In the following column, the matching contours are similar, since the poses
are in similar orientations. The Segment axis in the last column represents the temporal distance of the pair. Green box: A series of
transformers (Tx) compute self-attention, extracting the temporal relationship between the pair. Binary classification follows.
pt´δ , i.e. the premise, and a pose in timestep pt , i.e. the tail:
hypothesis, are the same person. To solve this problem, Position Token: The absolute spatial location of each
instead of using visual feature based similarity that incurs keypoint is the Position token, ρ, and its values fall in the
large computational cost, we use the set of human key- range r1, wF hF s. In practice, the absolute spatial location
points, K, detected by our pose estimator. It is computa- of a downsampled version of the original frame is used.
tionally efficient to use these as there are a limited number This not only improves the efficiency of our method, but
of them (in our case |K| “ 15), and they are not affected by also makes it more accurate, as is discussed in 5.2. We give
unexpected visual variations such as lighting changes in the a general expression for the Position tokens of poses pt and
t
tracking step. In addition, as we show in the next section, pt´δ , where ρpj corresponds to the Position token of the jth
keypoints are amenable to tokenization. Thus, during the keypoint of pt :
tracking stage, we use only the keypoints estimated by the
detector as our pose representation.
t t t t´δ t´δ t´δ
tρp1 , ρp2 , ... ρp|K| , ρp1 , ρp2 , ... ρp|K| u (1)
Tokenizing Pose Pairs The goal of tokenization is to Type Token: The Type token corresponds to the unique
transform pose information into a representation that facili- type of the keypoint: e.g. the head, left shoulder, right
tates learning spatio-temporal human pose relationships. To ankle, etc... The Type keypoints fall in the range r1, |K|s.
achieve this goal, for each pose token, we need to provide These add information about the orientation of the pose and
(i) the spatial location of each keypoint in the scene to allow are crucial for achieving high accuracy at low resolution,
the network to spatially correlate keypoints across frames, when keypoints have similar spatial locations. A general
(ii) type information of each keypoint (i.e. head, shoulder expression for the Type tokens of poses pt and pt´δ is be-
t
etc.) to learn spatial joint relationships in each human pose, low, where j p corresponds to the Type token of the jth
and finally (iii) the temporal location index for each key- keypoint of pt :
point within a temporal window δ, to learn temporal key-
t t t t´δ t´δ t´δ
point transitions. Hence, we use three different types of t1p , 2p , ... |K|p , 1p , 2p , ... |K|p u (2)
tokens for each keypoint as shown in Figure 3. There are
2 poses, and thus 2|K| tokens of each type. Each token is Segment Token: The Segment token indicates the num-
linearly projected to an embedding, E P R2|K|,H where H ber of timesteps the pose is from the current one. The seg-
is the transformer hidden size. Embeddings are a learned ment token is in range r1, δs, where δ is a chosen constant.
lookup table. We now describe the individual tokens in de- (We set δ to be 4.) This also allows our method to adapt
6741
to irregular frame rates. Or, if a person is not detected in timestep to make pose predictions, thereby combating these
a frame, we can look back two timesteps, conditioning our issues. This is based on the intuition that the spatial location
model on temporal token value of 2 instead of 1. of each person does not change dramatically from frame to
frame when the frame rate is relatively high, typical in most
t t t t´δ t´δ t´δ
t1p , 1p , ... 1p , δ p , δp , ... δ p u (3) modern datasets and cameras. Thus, pasting a bounding
box for the ith person in frame, F t´1 , pt´1,i , in its same
After each token is embedded, we sum the embeddings, spatial location in frame F t is a good approximation of the
Esum “ EP osition ` ET ype ` ESegment , to combine the true bounding box for person pt,i . Bounding boxes are en-
information from each class of token. This is fed to our larged by a small factor to account for changes in spatial
Transformer Matching Network. location from frame to frame. Previous approaches, such
as [54], use standard non-maximal suppression (NMS) to
Transformer Matching Network: The goal of our net- choose which of these boxes to input into the estimator.
work is to learn motion cues indicative of whether a pose Though this addresses the 1st issue of missed boxes, it does
pair matches. The self-attention mechanism of transform- not fully address the second issue. NMS relies on the con-
ers allows us to accomplish this by learning which temporal fidence score of the boxes. We make pose predictions for
relationships between the keypoints are representative of a the box in the current frame and temporally adjacent boxes.
match. Transformers compute scaled dot-product attention Then we use object-keypoint similarity (OKS) to determine
over a set of Queries (Q), Keys (K), and Values(V ) each of which of the poses should be kept. This is more accurate
which is a linear projection of the input Esum P R2|K|,H . than using NMS because we use the confidence scores of
We compute the softmax attention with respect to every the keypoints, not the bounding boxes. The steps of TOKS
keypoint embedding in the pair, with the input to the soft- are enumerated below:
max operation being of dimensions r2|K|, 2|K|s. In fact,
Algorithm 1 Temporal OKS
we can generate heatmaps from the attention distribution
over the pair’s keypoints, as displayed in 5.3. In practice, Input: pt´1 , pt , F t
we use multi-headed attention, which leads to the heads spe- 1. Retrieve bounding box, B, enclosing pt´1 , and dilate
cializing, also visualized. by a factor, α
Additionally, we use an attention mask to account for 2. Estimate a new pose, p1t , in F t from B
keypoints which are not visible due to occlusion. This at- 3. Use OKS to determine which pose to keep, p˚ “
tention mask is implemented exactly as the attention mask OKSpp1t , pt q
in [49], resulting in no attention being paid to the keypoints Output: p˚
which are not visible due to occlusion. The attention equa-
tion is as follows, and we detail each operation in a single
transformer in Table 5 of the Supplement: 4. Experiments
T 4.1. The PoseTrack Dataset
AttentionpQ, K, V q “ softmaxp QK
?
d
qV (4)
k
The PoseTrack 2017 training, validation, and test sets
After computing self-attention through a series of consist of 250, 50, and 208 videos, respectively. Annota-
stacked transformers, similar to BERT, we feed this repre- tions for the test set are held out. We evaluate on the Pose-
sentation to a Pooler, which “pools” the input, by selecting Track 17 Test set because the PoseTrack 18 Test set has yet
the first token in the sequence and then inputting that token to be released. We use the official evaluation server on the
into a learned linear projection. This is fed to another linear test set, which can be submitted to up to 4 times. [4, 1] We
layer, functioning as a binary classifier, which outputs the conduct the rest of comparisons on the PoseTrack ECCV
likelihood two given poses match. We govern training with 2018 Challenge Validation Set, a superset of PoseTrack 17
a binary cross entropy loss providing our network only with with 550 training, 74 validation, and 375 test videos [2].
the supervision of whether the pose pair is a match. See Metrics Per-joint Average Precision (AP) is used to eval-
Figure 3 for more details. uate keypoint estimation based on the formulation in [6].
Multi-Object Tracking Accuracy (MOTA [7], [33]) scores
3.3. Improved Multi-Frame Pose Estimation tracking. It penalizes False Negatives (FN), False Positives
We now describe how we improve keypoint estimation. (FP), and ID Switches (IDSW) under the following formu-
Top-down methods suffer from two primary classes of er- lation for each keypoint k i , where t is the current timestep.
rors from the object detector: 1. Missed bounding boxes Our final MOTA is the average of all keypoints k i P K:
pF Nti ` F Pti ` IDSWti q
ř
2. Imperfect bounding boxes. We use the box detections
from adjacent timesteps in addition to the one in the current 1´ t ř i
t GTt
6742
AP Ò % IDSW Ó MOTA Ò
Tracking Method Detection Method
Total Head Shou Elb Wri Hip Knee Ankl Total Total
Pose Entailment 0.7 0.7 0.6 0.6 0.6 0.7 0.7 0.7 99.3
GCN GT Boxes, GT Keypoints 100 1.4 1.4 1.4 1.5 1.4 1.6 1.6 1.5 98.5
Optical Flow 1.1 1.2 1.2 1.2 1.2 1.3 1.4 1.2 98.7
Pose Entailment 0.9 0.9 0.8 0.8 0.7 0.8 0.8 0.8 72.2
GCN GT Boxes, Predicted Keypoints 86.7 1.6 1.6 1.6 1.6 1.3 1.5 1.4 1.5 71.6
Optical Flow 1.2 1.2 1.2 1.1 1.0 1.1 1.1 1.1 71.8
Pose Entailment 0.9 1.0 0.9 0.8 0.7 0.8 0.8 0.8 66.6
GCN Predicted Boxes, Predicted Keypoints 81.6 1.7 1.7 1.7 1.7 1.4 1.5 1.4 1.6 65.9
Optical Flow 1.3 1.2 1.2 1.2 1.1 1.1 1.1 1.1 66.3
Figure 4. Compares accuracy of tracking methods on the PoseTrack 18 Val set, given the same keypoints. GT stands for Ground Truth,
“predicted” means a neural net is used. Lower % IDSW is better, higher MOTA is better. “Total” averages all joint scores.
Our approach assigns track ids and estimates keypoints in- Detection Method
AP
Head Shou Elb Wri Hip Knee Ankl Total
dependently. This is also true of competing methods with
GT 90.2 91.4 88.7 83.6 81.4 86.1 83.7 86.7
MOTA scores closest to ours. In light of this, we use the Det. 68.8 72.8 73.1 68.4 68.0 72.4 69.8 70.4
same keypoint estimations to compare Pose Entailment to Det. + Box Prop. 79.3 82.0 80.8 75.6 72.4 76.5 72.4 77.1
Det. + [email protected] 83.6 86.6 84.9 78.9 76.4 80.2 76.2 81.1
competing tracking methods in 4.2. This makes the IDSW Det. + [email protected] (ours) 84.1 87.2 85.3 79.2 77.1 80.6 76.5 81.6
the only component of the MOTA metric that
ř changes, and
Det. + [email protected] 83.9 87.2 85.2 79.1 77.1 80.7 76.4 81.5
we calculate %IDSW i “ t IDSWti { t GTti . In 4.3, Table 2. Per-joint AP when the pose estimator is conditioned on
ř
we compare our estimation method to others without evalu- different boxes. GT indicates ground truth boxes are used, and
ating tracking. Finally, in 4.4, we compare our entire track- serves as an upper bound for accuracy. Det. indicates a detector
was used to estimate boxes. @OKS* is the OKS threshold used.
ing pipeline to other pipelines.
6743
PoseTrack 2018 ECCV Challenge Val Set PoseTrack 2017 Test Set Leaderboard
No. Method Extra Data APT AP FPS MOTA No. Method Extra Data APT FPS MOTA
1. KeyTrack (ours) ✗ 74.3 81.6 1.0 66.6 1. KeyTrack (ours) ✗ 74.0 1.0 61.2
2. MIPAL [25] ✗ 74.6 - - 65.7 2. POINet [40] ✗ 72.5 - 58.4
3. LightTrack (offline) [35] ✗ 71.2 77.3 E 64.9 3. LightTrack [35] ✗ 66.7 E 58.0
4. LightTrack (online) [35] ✗ 72.4 77.2 0.7 64.6 4. HRNet [45] ✗ 75.0 0.2 57.9
5. Miracle [58] ✓ - 80.9 E 64.0 5. FlowTrack [54] ✗ 74.6 0.2 57.8
6. OpenSVAI [36] ✗ 69.7 76.3 - 62.4 6. MIPAL [25] ✗ 68.8 - 54.5
7. STAF [38] ✓ 70.4 - 3 60.9 7. STAF [38] ✓ 70.3 2 53.8
8. MDPN [20] ✓ 71.7 75.0 E 50.6 8. JointFlow [15] ✗ 63.6 0.2 53.1
Figure 5. Top scores on the PoseTrack leaderboards. E indicates an ensemble of detectors is used, and results in the method being offline.
A check indicates external training data is used beyond COCO and PoseTrack. A “-” indicates the information has not been made publicly
available. FPS calculations for JointFlow and FlowTrack are taken from [59]. HRNet FPS is approximated from FlowTrack since the
methods are very similar. The AP column has the best AP score. APT is the AP score after tracking post-processing.
Figure 6. Qualitative results of KeyTrackon the PoseTrack PoseTrack 17 Test Set. Additional qualitative results are in the supplement.
Efficiency: Our tracking approach is efficient, not reliant Abs. Position Type Segment Rel. Position Match % Accuracy
on optical flow or RGB data. When processing an image at
✓ ✓ ✗ ✗ 72.6
our optimal resolution, 24x18, we reduce the GFLOPS re-
quired by optical flow, which processes images at full size, ✓ ✗ ✓ ✗ 90.0
from 52.7 to 0.1. [35]’s GCN does not capture higher-order ✓ ✓ ✓ ✗ 93.2 (ours)
interactions over keypoints and can be more efficient than ✗ ✓ ✓ ✓ 91.3
our network with local convolutions. However, this trans- ✓ ✓ ✓ ✓ 92.0
lates to a „1ms improvement in GPU runtime. In fact, our Table 3. Match accuracies for various embedding schemes.
tracking pipeline demonstrates a 30% improvement in end-
to-end runtime over [35], shown in 4.4. We have the fastest
FPS of Top-down approaches. Also, we do not rely on op-
tical flow to improve bounding box propagation as [54, 45] keypoint distance from the pose center rather than the ab-
do, instead we use TOKS. This contributes to our 5x FPS solute distance of the keypoint in the entire image. We find
improvement over [54, 45]. Further details on the param- that match accuracy deteriorates with this embedding. This
eters and FLOPS of the GCN, Optical Flow Network, and is likely because many people perform the same activity,
our Transformer Matching Network are in Table 6 of the such as running, in the PoseTrack dataset, leading to them
Supplement. having nearly identical relative pose positions. We vary the
number of transformer blocks, the hidden size in the trans-
5. Analysis former block, and number of heads in Table 7. Decreasing
5.1. Tracking Pipeline the number of transformer blocks, the hidden size, and at-
tention heads hurts performance.
Varying Tokenization Schemes and Transformer
Hyper-parameters We examine the benefits of each em- Number of Timesteps and Other Factors We find
bedding. As evident in Table 3, Segment embeddings are that reducing the number of timesteps adversely effects the
crucial because they enable the network to distinguish be- MOTA score. It drops up to 0.3 points when using only a
tween the Poses being matched. Token embeddings give single timestep because we are less robust to detection er-
the network information about the orientation of a pose and rors. Also, in replacement of our greedy algorithm, we ex-
help it interpret keypoints which are in close spatial prox- perimented with the Hungarian algorithm used in [19]. This
imity; i.e. keypoints that have the same or similar position algorithm is effective with ground truth information, but is
embedding. We also train a model that uses the relative not accurate when using detected poses.
6744
Num Tx Hidden Size Int. Size Num Heads Parameters (M) % IDSW
2 128 512 4 0.40 1.0
4 128 512 4 0.43 0.8
6 128 512 4 1.26 1.1
4 64 256 4 0.23 0.9
4 128 512 4 0.43 0.8
4 256 1024 4 3.31 1.1
4 128 128 4 0.43 0.8
4 128 512 4 0.86 0.8
4 128 128 2 0.43 0.9
4 128 128 4 0.43 0.8
4 128 128 6 0.43 0.8
Figure 7. Left: Transformer network hyper-parameters are varied. Right: A plot of IDSW rate vs. image resolution. The table on the left
shows the input to each method, the conv+visual input is blurry because images are downsampled.
6745
References [16] Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorial
structures for object recognition. International journal of
[1] Posetrack leaderboard, 2017 test set, 2017. 2, 5 computer vision, 61(1):55–79, 2005. 2
[2] Posetrack challenge - eccv 2018, 2018. 5 [17] Mihai Fieraru, Anna Khoreva, Leonid Pishchulin, and Bernt
[3] Abdulrahman Alarifi, AbdulMalik Al-Salman, Mansour Al- Schiele. Learning to refine human pose estimation. In Pro-
saleh, Ahmad Alnafessah, Suheer Al-Hadhrami, Mai A Al- ceedings of the IEEE Conference on Computer Vision and
Ammar, and Hend S Al-Khalifa. Ultra wideband indoor po- Pattern Recognition Workshops, pages 205–214, 2018. 2
sitioning technologies: Analysis and recent advances. Sen- [18] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zis-
sors, 16(5):707, 2016. 2 serman. Video action transformer network. In Proceedings
[4] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, of the IEEE Conference on Computer Vision and Pattern
Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt Recognition, pages 244–253, 2019. 3, 8
Schiele. Posetrack: A benchmark for human pose estima- [19] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani,
tion and tracking. In CVPR, 2018. 2, 5 Manohar Paluri, and Du Tran. Detect-and-track: Efficient
[5] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pic- pose estimation in videos. In Proceedings of the IEEE Con-
torial structures revisited: People detection and articulated ference on Computer Vision and Pattern Recognition, pages
pose estimation. In 2009 IEEE conference on computer vi- 350–359, 2018. 2, 3, 7
sion and pattern recognition, pages 1014–1021. IEEE, 2009. [20] Hengkai Guo, Tang Tang, Guozhong Luo, Riwei Chen,
2 Yongchen Lu, and Linfu Wen. Multi-domain pose network
[6] Mykhaylo Andriluka1, Leonid Pishchulin, Peter Gehler, and for multi-person pose estimation and tracking. Computer Vi-
Bernt Schiele. 2d human pose estimation - mpii human pose sion – ECCV 2018 Workshops, page 209–216, 2019. 2, 3,
dataset. In CVPR, 2014. 5 7
[7] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple [21] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
object tracking performance: the clear mot metrics. Journal shick. Mask r-cnn. In Proceedings of the IEEE international
on Image and Video Processing, 2008:1, 2008. 5 conference on computer vision, pages 2961–2969, 2017. 2
[22] Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor
[8] Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo
Darrell, and Kate Saenko. Modeling relationships in ref-
Shi, and Lorenzo Torresani. Learning temporal pose es-
erential expressions with compositional modular networks.
timation from sparsely-labeled videos. arXiv preprint
In Proceedings of the IEEE Conference on Computer Vision
arXiv:1906.04016, 2019. 2
and Pattern Recognition, 2016. 3
[9] Samuel R Bowman, Gabor Angeli, Christopher Potts,
[23] Hao Huang, Luowei Zhou, Wei Zhang, Jason J Corso, and
and Christopher D Manning. A large annotated corpus
Chenliang Xu. Dynamic graph modules for modeling object-
for learning natural language inference. arXiv preprint
object interactions in activity recognition: Supplementary
arXiv:1508.05326, 2015. 3
material. BMVC, 2019. 3
[10] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang
[24] Shaoli Huang, Mingming Gong, and Dacheng Tao. A coarse-
Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network
fine network for keypoint localization. In Proceedings of the
for multi-person pose estimation. In Proceedings of the IEEE
IEEE International Conference on Computer Vision, pages
Conference on Computer Vision and Pattern Recognition,
3028–3037, 2017. 2
pages 7103–7112, 2018. 2
[25] Jihye Hwang, Jieun Lee, Sungheon Park, and Nojun Kwak.
[11] Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual re- Pose estimator and tracker using temporal flow maps for
lationships with deep relational networks. In Proceedings limbs, 2019. 2, 6, 7
of the IEEE Conference on Computer Vision and Pattern
[26] Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin,
Recognition, 2017. 3
Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt
[12] Matthias Dantone, Juergen Gall, Christian Leistner, and Luc Schiele. Arttrack: Articulated multi-person tracking in the
Van Gool. Human pose estimation using body parts depen- wild. In Proceedings of the IEEE Conference on Computer
dent joint regressors. In Proceedings of the IEEE Conference Vision and Pattern Recognition, pages 6457–6465, 2017. 2,
on Computer Vision and Pattern Recognition, pages 3041– 3
3048, 2013. 2 [27] Umar Iqbal, Anton Milan, and Juergen Gall. Posetrack: Joint
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina multi-person pose estimation and tracking. In Proceedings
Toutanova. BERT: pre-training of deep bidirectional trans- of the IEEE Conference on Computer Vision and Pattern
formers for language understanding. CoRR, abs/1810.04805, Recognition, pages 2011–2020, 2017. 2, 3
2018. 3 [28] Sheng Jin, Wentao Liu, Wanli Ouyang, and Chen Qian.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Multi-person articulated tracking with spatial and tempo-
Toutanova. Bert: Pre-training of deep bidirectional ral embeddings. In Proceedings of the IEEE Conference
transformers for language understanding. arXiv preprint on Computer Vision and Pattern Recognition, pages 5664–
arXiv:1810.04805, 2018. 2 5673, 2019. 2
[15] Andreas Doering, Umar Iqbal, and Juergen Gall. Joint flow: [29] Sheng Jin, Xujie Ma, Zhipeng Han, Yue Wu, Wei Yang,
Temporal flow fields for multi person tracking, 2018. 7 Wentao Liu, Chen Qian, and Wanli Ouyang. Towards multi-
6746
person pose tracking: Bottom-up and top-down methods. In [43] Karen Simonyan and Andrew Zisserman. Very deep convo-
ICCV PoseTrack Workshop, volume 2, page 7, 2017. 3 lutional networks for large-scale image recognition, 2014. 8
[30] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, [44] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia
and Kai-Wei Chang. Visualbert: A simple and perfor- Schmid. Contrastive bidirectional transformer for temporal
mant baseline for vision and language. arXiv preprint representation learning. arXiv preprint arXiv:1906.05743,
arXiv:1908.03557, 2019. 3 2019. 3
[31] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. [45] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
Vilbert: Pretraining task-agnostic visiolinguistic represen- high-resolution representation learning for human pose esti-
tations for vision-and-language tasks. arXiv preprint mation. CoRR, abs/1902.09212, 2019. 1, 2, 3, 7
arXiv:1908.02265, 2019. 3 [46] Hao Tan and Mohit Bansal. Lxmert: Learning cross-
[32] Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan modality encoder representations from transformers. arXiv
AlRegib, and Hans Peter Graf. Attend and interact: Higher- preprint arXiv:1908.07490, 2019. 3
order object interactions for video understanding. In Pro- [47] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun,
ceedings of the IEEE Conference on Computer Vision and and Christoph Bregler. Efficient object localization using
Pattern Recognition, pages 6790–6800, 2018. 3 convolutional networks. In Proceedings of the IEEE Con-
[33] Anton Milan, Laura Leal-Taixé, Ian D. Reid, Stefan Roth, ference on Computer Vision and Pattern Recognition, pages
and Konrad Schindler. MOT16: A benchmark for multi- 648–656, 2015. 1, 2
object tracking. CoRR, abs/1603.00831, 2016. 5 [48] Alexander Toshev and Christian Szegedy. Deeppose: Human
[34] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. pose estimation via deep neural networks. In Proceedings of
Posefix: Model-agnostic general human pose refinement net- the IEEE conference on computer vision and pattern recog-
work. In Proceedings of the IEEE Conference on Computer nition, pages 1653–1660, 2014. 1, 2
Vision and Pattern Recognition, pages 7773–7781, 2019. 2 [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
[35] Guanghan Ning and Heng Huang. Lighttrack: A generic reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
framework for online top-down human pose tracking. arXiv Polosukhin. Attention is all you need, 2017. 1, 5
preprint arXiv:1905.02822, 2019. 1, 2, 6, 7 [50] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning
[36] Guanghan Ning, Ping Liu, Xiaochuan Fan, and Chi Zhang. correspondence from the cycle-consistency of time. In Pro-
A top-down approach to articulated human pose estimation ceedings of the IEEE Conference on Computer Vision and
and tracking. Computer Vision – ECCV 2018 Workshops, Pattern Recognition, pages 2566–2576, 2019. 2
page 227–234, 2019. 7 [51] Yang Wang and Greg Mori. Multiple tree models for oc-
[37] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander clusion and spatial constraints in human pose estimation. In
Toshev, Jonathan Tompson, Chris Bregler, and Kevin Mur- European Conference on Computer Vision, pages 710–724.
phy. Towards accurate multi-person pose estimation in the Springer, 2008. 2
wild. In Proceedings of the IEEE Conference on Computer [52] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser
Vision and Pattern Recognition, pages 4903–4911, 2017. 2 Sheikh. Convolutional pose machines. In Proceedings of the
[38] Yaadhav Raaj, Haroon Idrees, Gines Hidalgo, and Yaser IEEE Conference on Computer Vision and Pattern Recogni-
Sheikh. Efficient online multi-person 2d pose tracking with tion, pages 4724–4732, 2016. 1, 2
recurrent spatio-temporal affinity fields. In Proceedings [53] Fangting Xia, Peng Wang, Xianjie Chen, and Alan L Yuille.
of the IEEE Conference on Computer Vision and Pattern Joint multi-person pose estimation and semantic part seg-
Recognition, pages 4620–4628, 2019. 2, 7 mentation. In Proceedings of the IEEE Conference on Com-
[39] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan puter Vision and Pattern Recognition, pages 6769–6778,
Bello, Anselm Levskaya, and Jonathon Shlens. Stand- 2017. 2
alone self-attention in vision models. arXiv preprint [54] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines
arXiv:1906.05909, 2019. 3 for human pose estimation and tracking. In European Con-
[40] Weijian Ruan, Wu Liu, Qian Bao, Jun Chen, Yuhao Cheng, ference on Computer Vision (ECCV), 2018. 5, 6, 7
and Tao Mei. Poinet: Pose-guided ovonic insight network for [55] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and
multi-person pose tracking. In Proceedings of the 27th ACM Cewu Lu. Pose flow: Efficient online pose tracking. arXiv
International Conference on Multimedia, MM ’19, pages preprint arXiv:1802.00977, 2018. 3
284–292, New York, NY, USA, 2019. ACM. 2, 6, 7 [56] Jiarui Xu, Yue Cao, Zheng Zhang, and Han Hu. Spatial-
[41] Adam Santoro, David Raposo, David GT Barrett, Mateusz temporal relation networks for multi-object tracking. arXiv
Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy preprint arXiv:1904.11489, 2019. 3
Lillicrap. A simple neural network module for relational rea- [57] Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and
soning. In Advances in Neural Information Processing Sys- Xiaogang Wang. Learning feature pyramids for human pose
tems, 2017. 3 estimation. In Proceedings of the IEEE International Con-
[42] Leonid Sigal and Michael J Black. Measure locally, reason ference on Computer Vision, pages 1281–1290, 2017. 1, 2
globally: Occlusion-sensitive articulated pose estimation. In [58] Dongdong Yu, Kai Su, Jia Sun, and Changhu Wang. Multi-
2006 IEEE Computer Society Conference on Computer Vi- person pose estimation for pose tracking with enhanced cas-
sion and Pattern Recognition (CVPR’06), volume 2, pages caded pyramid network. In European Conference on Com-
2041–2048. IEEE, 2006. 2 puter Vision, pages 221–226. Springer, 2018. 7
6747
[59] Jiabin Zhang, Zheng Zhu, Wei Zou, Peng Li, Yanwei Li,
Hu Su, and Guan Huang. Fastpose: Towards real-time pose
estimation and tracking via scale-normalized multi-task net-
works, 2019. 7
[60] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Ja-
son J Corso, and Jianfeng Gao. Unified vision-language
pre-training for image captioning and vqa. arXiv preprint
arXiv:1909.11059, 2019. 3
[61] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher,
and Caiming Xiong. End-to-end dense video captioning with
masked transformer. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 8739–
8748, 2018. 3
6748