0% found this document useful (0 votes)
21 views

Key Track

Research paper on human pose tracking

Uploaded by

merrysnail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Key Track

Research paper on human pose tracking

Uploaded by

merrysnail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

15 Keypoints Is All You Need

Michael Snower: * Asim Kadav; Farley Lai; Hans Peter Graf;


:
Brown University ; NEC Labs America
michael [email protected] {asim,farleylai,hpg}@nec-labs.com

Abstract

Pose tracking is an important problem that requires iden-


tifying unique human pose-instances and matching them
temporally across different frames of a video. However,
existing pose tracking methods are unable to accurately
model temporal relationships and require significant com-
putation, often computing the tracks offline. We present
an efficient multi-person pose tracking method, KeyTrack,
that only relies on keypoint information without using any
RGB or optical flow information to track human keypoints
in real-time. Keypoints are tracked using our Pose Entail-
ment method, in which, first, a pair of pose estimates is sam- Figure 1. They look alike, how do we decide who's who? In
pled from different frames in a video and tokenized. Then, the Pose Entailment framework, given a video frame, we track
individuals by comparing pairs of poses, using temporal motion
a Transformer-based network makes a binary classification
cues to determine who’s who. Using a novel tokenization scheme
as to whether one pose temporally follows another. Fur-
to create pose pair inputs interpretable by Transformers [49], our
thermore, we improve our top-down pose estimation method network divides its attention equally between both poses in match-
with a novel, parameter-free, keypoint refinement technique ing pairs, and focuses more on a single pose in non-matching pairs
that improves the keypoint estimates used during the Pose because motion cues between keypoints are not present. We visu-
Entailment step. We achieve state-of-the-art results on the alize this above; bright red keypoints correspond to high attention.
PoseTrack’17 and the PoseTrack’18 benchmarks while us-
ing only a fraction of the computation required by most as optical flow and graph convolutional networks (GCNs)
other methods for computing the tracking information. are effective at modeling spatio-temporal keypoint relation-
ships [45], [35], but are dependent on high spatial resolu-
tion, making them computationally costly. Non-learning
based methods, such as spatial consistency, are faster than
1. Introduction the convolution-based methods, but are not as accurate.
Multi-person Pose Tracking is an important problem for To address the above limitations, we propose an efficient
human action recognition and video understanding. It oc- pose tracking method, KeyTrack, that leverages temporal
curs in two steps: first, estimation, where keypoints of in- relationships to improve multi-person pose estimation and
dividual persons are localized; second, the tracking step, tracking. Hence, KeyTrack follows the tracking by detec-
where each keypoint is assigned to a unique person. Pose tion approach by first localizing humans, estimating human
tracking methods rely on deep convolutional neural net- pose keypoints and then encoding the keypoint informa-
works for the first step [48, 47, 57, 52], but approaches in tion in a novel entailment setting using transformer build-
the second step vary. This is a challenging problem because ing blocks [49]. Similar to the textual entailment task where
tracks must be created for each unique person, while over- one has to predict if one sentence follows one another, we
coming occlusion and complex motion. Moreover, individ- propose the Pose Entailment task, where the model learns
uals may appear visually similar because they are wear- to make a binary classification if two keypoint poses tem-
ing the same uniform. It is also important for tracking porally follow or entail each other. Hence, rather than ex-
to be performed online. Commonly used methods, such tracting information from a high-dimensional image repre-
sentation using deep CNNs, we extract information from a
* Work done as a NEC Labs intern sentence of 15 tokens, and each token corresponds to a key-

16738
point on a pose. Similar to how BERT tokenizes words [14], Method Estimation Detection Improvement Tracking

we propose an embedding scheme for pose data that cap- Ours HRNet Temporal OKS Pose Entailment
HRNet [45] HRNet BBox Prop. Optical Flow
tures spatio-temporal relationships and feed our transformer

Top-Down
POINet [40] VGG, T-VGG - Ovonic Insight Net
network these embeddings. Since these embeddings contain MDPN [20] MDPN Ensemble Optical Flow
information beyond spatial location, our network outper- LightTrack [35] Simple Baselines Ensemble/BBox Prop. GCN
ProTracker [19] 3D Mask RCNN - IoU
forms convolution-based approaches in terms of accuracy

Bottom-Up
Affinity Fields [38] VGG/STFields - STFields
and speed, particularly at very low resolutions. STEmbeddings [28] STEmbeddings - STEmbeddings
Additionally, in order to improve the keypoint estimates JointFlow Siamese CNN - Flow Fields
used by the transformer network, we propose a Temporal Table 1. How different approaches address each step of the Pose
Object Keypoint Similarity (TOKS) method. TOKS refines Tracking problem. Our contributions are in bold.
the pose estimation output by augmenting missed detec-
tions and thresholding low quality estimates using a key-
coordinates of the joints [48] or to generate heatmaps of
point similarity metric. TOKS adds no learned parameters
the probability of a joint’s location [47, 57, 52]. A major-
to the estimation step, and is superior to existing bounding
ity of the convolutional approaches can be classified into
box propagation methods that often rely on NMS and opti-
top-down and bottom-up methods – the top-down meth-
cal flow. KeyTrack makes the following contributions:
ods use a separate detection step to identify person can-
1. KeyTrack introduces Pose Entailment, where a binary
didates [21, 37, 10, 24, 37]. The single person pose esti-
classification is made as to whether two poses from differ-
mation step is then performed on these person candidates.
ent timesteps are the same person. We model this task in a
Bottom-up methods calculate keypoints from all candidates
transformer-based network which learns temporal pose re-
and then correlate these keypoints into individual human
lationships even in datasets with complex motion. Further-
joints [53, 25]. The latter method is more efficient since
more, we present a tokenization scheme for pose informa-
all keypoints are calculated in a single step; however, the
tion that allows transformers to outperform convolutions at
former is more accurate since the object detection step lim-
low spatial resolutions when tracking keypoints.
its the regression boundaries. However, top-down methods
2. KeyTrack introduces a temporal method for improv-
work poorly on small objects and recent work (HRNet) [45]
ing keypoint estimates. TOKS is more accurate than bound-
uses parallel networks at different resolutions to maximize
ing box propagation, faster than a detector ensemble, and
spatial information. PoseWarper [8] uses a pair of labeled
does not require learned parameters.
and unlabeled frames to predict human pose by learning
Using the above methods, we develop an efficient multi-
the pose-warping using deformable convolutions. Finally,
person pose tracking pipeline which sets a new SOTA on the
since the earliest applications of deep learning to pose esti-
PoseTrack test set. We achieve 61.2% tracking accuracy on
mation [48], iterative predictions have improved accuracy.
the PoseTrack’17 Test Set and 66.6% on the PoseTrack’18
Pose estimation has shown to benefit from cascaded predic-
Val set using a model that consists of just 0.43M parameters
tions [10] and pose-refinement methods [17, 34] refine the
in the tracking step. This portion of our pipeline 500X more
pose estimation results of previous stages using a separate
efficient than than the leading optical flow method [45]. Our
post-processing network. In that spirit, our work, KeyTrack
training is performed on a single NVIDIA 1080Ti GPU. Not
relies on HRNet to generate keypoints and refines keypoint
reliant on RGB or optical flow information in the tracking
estimates by temporally aggregating and suppressing low
step, our model is suitable to perform pose tracking using
confidence keypoints with TOKS instead of commonly used
other non-visual pose estimation sensors that only provide
bounding box propagation approaches.
15 keypoints for each person [3].

2. Related Work Pose tracking Methods Pose tracking methods assign


unique IDs to individual keypoints, estimated with tech-
We are inspired by related work on pose estimation and niques described in the previous subsection, to track them
tracking methods, and recent work on applying the trans- through time [4, 26, 27, 1]. Some methods perform tracking
former network to video understanding. by learning spatio-temporal pose relationships across video
frames using convolutions [50, 40, 35]. [40], in an end-
Pose estimation Early work on pose estimation uses to-end fashion, predicts track ids with embedded visual fea-
graphical models to learn spatial correlations and inter- tures from its estimation step, making predictions in mul-
actions between various joints [5, 16]. These models tiple temporal directions. [35] uses a GCN to track poses
often perform poorly due to occlusions and long range based on spatio-temporal keypoint relationships. These net-
temporal relationships, which need to be explicitly mod- works require high spatial resolutions. In contrast, we cre-
eled [12, 42, 51]. More recent work involves using convolu- ate keypoint embeddings from the keypoint’s spatial loca-
tional neural networks (CNNs) to directly regress cartesian tion and other information making our network less reliant

6739
Figure 2. a) Keypoints are estimated with HRNet. b) TOKS improves detection accuracy. c) Pose pairs are collected from multiple past
timesteps. Poses of the same color have the same track id, the color black indicates the track id is unknown. d) Each pair is tokenized
independently from the other pairs. e) Our Transformer Matching Network calculates match scores independently for each pair. f) The
maximum match score is greedily chosen and the corresponding track id is assigned.

on spatial resolution, and thus more efficient. We can also pose pt,i P P t . First, each of the pose’s k j P K keypoints
model more fine-grained spatio-temporal relationships. are detected. This is done by localizing a bounding box
Among non-learned tracking methods, optical flow prop- around each pose with an object detector and then estimat-
agates poses from one frame to the next to determine which ing keypoint locations in the box. Keypoint predictions are
pose they are most similar to in the next frame [45, 20]. This improved with temporal OKS (TOKS). Please see 3.3 for
improves over spatial consistency, which measures the IoU more details. From here, this pose with no tracking id, pt,iI,
between bounding boxes of poses from temporally adjacent is assigned its appropriate one. This is based on the pose’s
frames [19]. Other methods use graph-partitioning based similarity to a pose in a previous timestep, which has an id,
approaches to group pose tracks [26, 27, 29]. Another pt´δ,j
id . Similarity is measured with the match score, mt´δ,j
id ,
method, PoseFlow [55], uses inter/intra-frame pose distance using Pose Entailment (3.2).
and NMS to construct pose flows. However, our method False negatives are an inevitable problem in keypoint
does not require hard-coded parameters during inference, detection, and hurt the downstream tracking step because
this limits the ability of non-learned methods to model poses with the correct track id may appear to be no longer
scenes with complex motion and requires time-intensive in the video. We mitigate this by calculating match scores
manual tuning. Table 1 shows top-down methods similar for poses in not just one previous frame, but multiple frames
t´d,j
to our work as well as competitive bottom-up methods. tF 1 , F 2 , ... F δ u. Thus, we compare to each pose pid
where 1 ď d ď δ and 1 ď j ď |P t´d |. In practice, we limit
Transformer Models Recently, there have been success- the number of poses we compare to in a given frame to the
ful implementations of transformer-based models for image n spatially nearest poses. This is just as accurate as com-
and video input modalities often substituting convolutions paring to everyone in the frame and bounds our runtime to
and recurrence mechanisms. These methods can efficiently Opδnq. This gives us a set of match scores M, and we as-
model higher-order relationships between various scene ele- sign pt,i
I the track id corresponding to the maximum match
ments unlike pair-wise methods [11, 22, 41, 56]. They have score m˚ “ maxpMq, where id˚ “ m˚id . Thus, we assign
been applied for image classification [39], visual question- the tracking id to the pose, pt,i
id˚ .
answering [30, 31, 46, 60], action-recognition [23, 32],
video captioning [44, 61] and other video problems. Video- 3.2. Pose Entailment
Action Transformer [18] solves the action localization prob- To effectively solve the multi-person pose tracking prob-
lem using transformers by learning the context and interac- lem, we need to understand how human poses move through
tions for every person in the video. BERT [13] uses trans- time based on spatial joint configurations as well as in the
formers by pretraining a transformer-based network in a presence of multiple persons and occluding objects. Hence,
multi-task transfer learning scheme over the unsupervised we need to learn if a pose in timestep t, can be inferred from
tasks of predicting missing words or next sentences. In- timestep t´1. Textual entailment provides us with a similar
stead, in a supervised setting, KeyTrack uses transformers framework in the NLP domain where one needs to under-
to learn spatio-temporal keypoint relationships for the vi- stand if one sentence can be implied from the next. More
sual problem of pose tracking. specifically, the textual entailment model classifies whether
a premise sentence implies a hypothesis sentence in a sen-
3. Method tence pair [9]. The typical approach to this problem con-
sists of first projecting the pair of sentences to an embed-
3.1. Overview of Our Approach
ding space and then feeding them through a neural network
We now describe the keypoint estimation and tracking which outputs a binary classification for the sentence pair.
approach used in KeyTrack as shown in Figure 2. For frame Hence, we propose the Pose Entailment problem. More
F t at timestep t, we wish to assign a track id to the ith formally, we seek to classify whether a pose in a timestep

6740
Figure 3. Orange box: Visualizations to intuitively explain our tokenization. In the Position column, the matching poses are spatially
closer together than the non-matching ones. This is because their spatial locations in the image are similar. The axis limit is 432 because
the image has been downsampled to width ˚ height “ 432. In the following column, the matching contours are similar, since the poses
are in similar orientations. The Segment axis in the last column represents the temporal distance of the pair. Green box: A series of
transformers (Tx) compute self-attention, extracting the temporal relationship between the pair. Binary classification follows.

pt´δ , i.e. the premise, and a pose in timestep pt , i.e. the tail:
hypothesis, are the same person. To solve this problem, Position Token: The absolute spatial location of each
instead of using visual feature based similarity that incurs keypoint is the Position token, ρ, and its values fall in the
large computational cost, we use the set of human key- range r1, wF hF s. In practice, the absolute spatial location
points, K, detected by our pose estimator. It is computa- of a downsampled version of the original frame is used.
tionally efficient to use these as there are a limited number This not only improves the efficiency of our method, but
of them (in our case |K| “ 15), and they are not affected by also makes it more accurate, as is discussed in 5.2. We give
unexpected visual variations such as lighting changes in the a general expression for the Position tokens of poses pt and
t
tracking step. In addition, as we show in the next section, pt´δ , where ρpj corresponds to the Position token of the jth
keypoints are amenable to tokenization. Thus, during the keypoint of pt :
tracking stage, we use only the keypoints estimated by the
detector as our pose representation.
t t t t´δ t´δ t´δ
tρp1 , ρp2 , ... ρp|K| , ρp1 , ρp2 , ... ρp|K| u (1)

Tokenizing Pose Pairs The goal of tokenization is to Type Token: The Type token corresponds to the unique
transform pose information into a representation that facili- type of the keypoint: e.g. the head, left shoulder, right
tates learning spatio-temporal human pose relationships. To ankle, etc... The Type keypoints fall in the range r1, |K|s.
achieve this goal, for each pose token, we need to provide These add information about the orientation of the pose and
(i) the spatial location of each keypoint in the scene to allow are crucial for achieving high accuracy at low resolution,
the network to spatially correlate keypoints across frames, when keypoints have similar spatial locations. A general
(ii) type information of each keypoint (i.e. head, shoulder expression for the Type tokens of poses pt and pt´δ is be-
t
etc.) to learn spatial joint relationships in each human pose, low, where j p corresponds to the Type token of the jth
and finally (iii) the temporal location index for each key- keypoint of pt :
point within a temporal window δ, to learn temporal key-
t t t t´δ t´δ t´δ
point transitions. Hence, we use three different types of t1p , 2p , ... |K|p , 1p , 2p , ... |K|p u (2)
tokens for each keypoint as shown in Figure 3. There are
2 poses, and thus 2|K| tokens of each type. Each token is Segment Token: The Segment token indicates the num-
linearly projected to an embedding, E P R2|K|,H where H ber of timesteps the pose is from the current one. The seg-
is the transformer hidden size. Embeddings are a learned ment token is in range r1, δs, where δ is a chosen constant.
lookup table. We now describe the individual tokens in de- (We set δ to be 4.) This also allows our method to adapt

6741
to irregular frame rates. Or, if a person is not detected in timestep to make pose predictions, thereby combating these
a frame, we can look back two timesteps, conditioning our issues. This is based on the intuition that the spatial location
model on temporal token value of 2 instead of 1. of each person does not change dramatically from frame to
frame when the frame rate is relatively high, typical in most
t t t t´δ t´δ t´δ
t1p , 1p , ... 1p , δ p , δp , ... δ p u (3) modern datasets and cameras. Thus, pasting a bounding
box for the ith person in frame, F t´1 , pt´1,i , in its same
After each token is embedded, we sum the embeddings, spatial location in frame F t is a good approximation of the
Esum “ EP osition ` ET ype ` ESegment , to combine the true bounding box for person pt,i . Bounding boxes are en-
information from each class of token. This is fed to our larged by a small factor to account for changes in spatial
Transformer Matching Network. location from frame to frame. Previous approaches, such
as [54], use standard non-maximal suppression (NMS) to
Transformer Matching Network: The goal of our net- choose which of these boxes to input into the estimator.
work is to learn motion cues indicative of whether a pose Though this addresses the 1st issue of missed boxes, it does
pair matches. The self-attention mechanism of transform- not fully address the second issue. NMS relies on the con-
ers allows us to accomplish this by learning which temporal fidence score of the boxes. We make pose predictions for
relationships between the keypoints are representative of a the box in the current frame and temporally adjacent boxes.
match. Transformers compute scaled dot-product attention Then we use object-keypoint similarity (OKS) to determine
over a set of Queries (Q), Keys (K), and Values(V ) each of which of the poses should be kept. This is more accurate
which is a linear projection of the input Esum P R2|K|,H . than using NMS because we use the confidence scores of
We compute the softmax attention with respect to every the keypoints, not the bounding boxes. The steps of TOKS
keypoint embedding in the pair, with the input to the soft- are enumerated below:
max operation being of dimensions r2|K|, 2|K|s. In fact,
Algorithm 1 Temporal OKS
we can generate heatmaps from the attention distribution
over the pair’s keypoints, as displayed in 5.3. In practice, Input: pt´1 , pt , F t
we use multi-headed attention, which leads to the heads spe- 1. Retrieve bounding box, B, enclosing pt´1 , and dilate
cializing, also visualized. by a factor, α
Additionally, we use an attention mask to account for 2. Estimate a new pose, p1t , in F t from B
keypoints which are not visible due to occlusion. This at- 3. Use OKS to determine which pose to keep, p˚ “
tention mask is implemented exactly as the attention mask OKSpp1t , pt q
in [49], resulting in no attention being paid to the keypoints Output: p˚
which are not visible due to occlusion. The attention equa-
tion is as follows, and we detail each operation in a single
transformer in Table 5 of the Supplement: 4. Experiments
T 4.1. The PoseTrack Dataset
AttentionpQ, K, V q “ softmaxp QK
?
d
qV (4)
k
The PoseTrack 2017 training, validation, and test sets
After computing self-attention through a series of consist of 250, 50, and 208 videos, respectively. Annota-
stacked transformers, similar to BERT, we feed this repre- tions for the test set are held out. We evaluate on the Pose-
sentation to a Pooler, which “pools” the input, by selecting Track 17 Test set because the PoseTrack 18 Test set has yet
the first token in the sequence and then inputting that token to be released. We use the official evaluation server on the
into a learned linear projection. This is fed to another linear test set, which can be submitted to up to 4 times. [4, 1] We
layer, functioning as a binary classifier, which outputs the conduct the rest of comparisons on the PoseTrack ECCV
likelihood two given poses match. We govern training with 2018 Challenge Validation Set, a superset of PoseTrack 17
a binary cross entropy loss providing our network only with with 550 training, 74 validation, and 375 test videos [2].
the supervision of whether the pose pair is a match. See Metrics Per-joint Average Precision (AP) is used to eval-
Figure 3 for more details. uate keypoint estimation based on the formulation in [6].
Multi-Object Tracking Accuracy (MOTA [7], [33]) scores
3.3. Improved Multi-Frame Pose Estimation tracking. It penalizes False Negatives (FN), False Positives
We now describe how we improve keypoint estimation. (FP), and ID Switches (IDSW) under the following formu-
Top-down methods suffer from two primary classes of er- lation for each keypoint k i , where t is the current timestep.
rors from the object detector: 1. Missed bounding boxes Our final MOTA is the average of all keypoints k i P K:
pF Nti ` F Pti ` IDSWti q
ř
2. Imperfect bounding boxes. We use the box detections
from adjacent timesteps in addition to the one in the current 1´ t ř i
t GTt

6742
AP Ò % IDSW Ó MOTA Ò
Tracking Method Detection Method
Total Head Shou Elb Wri Hip Knee Ankl Total Total
Pose Entailment 0.7 0.7 0.6 0.6 0.6 0.7 0.7 0.7 99.3
GCN GT Boxes, GT Keypoints 100 1.4 1.4 1.4 1.5 1.4 1.6 1.6 1.5 98.5
Optical Flow 1.1 1.2 1.2 1.2 1.2 1.3 1.4 1.2 98.7
Pose Entailment 0.9 0.9 0.8 0.8 0.7 0.8 0.8 0.8 72.2
GCN GT Boxes, Predicted Keypoints 86.7 1.6 1.6 1.6 1.6 1.3 1.5 1.4 1.5 71.6
Optical Flow 1.2 1.2 1.2 1.1 1.0 1.1 1.1 1.1 71.8
Pose Entailment 0.9 1.0 0.9 0.8 0.7 0.8 0.8 0.8 66.6
GCN Predicted Boxes, Predicted Keypoints 81.6 1.7 1.7 1.7 1.7 1.4 1.5 1.4 1.6 65.9
Optical Flow 1.3 1.2 1.2 1.2 1.1 1.1 1.1 1.1 66.3
Figure 4. Compares accuracy of tracking methods on the PoseTrack 18 Val set, given the same keypoints. GT stands for Ground Truth,
“predicted” means a neural net is used. Lower % IDSW is better, higher MOTA is better. “Total” averages all joint scores.

Our approach assigns track ids and estimates keypoints in- Detection Method
AP
Head Shou Elb Wri Hip Knee Ankl Total
dependently. This is also true of competing methods with
GT 90.2 91.4 88.7 83.6 81.4 86.1 83.7 86.7
MOTA scores closest to ours. In light of this, we use the Det. 68.8 72.8 73.1 68.4 68.0 72.4 69.8 70.4
same keypoint estimations to compare Pose Entailment to Det. + Box Prop. 79.3 82.0 80.8 75.6 72.4 76.5 72.4 77.1
Det. + [email protected] 83.6 86.6 84.9 78.9 76.4 80.2 76.2 81.1
competing tracking methods in 4.2. This makes the IDSW Det. + [email protected] (ours) 84.1 87.2 85.3 79.2 77.1 80.6 76.5 81.6
the only component of the MOTA metric that
ř changes, and
Det. + [email protected] 83.9 87.2 85.2 79.1 77.1 80.7 76.4 81.5

we calculate %IDSW i “ t IDSWti { t GTti . In 4.3, Table 2. Per-joint AP when the pose estimator is conditioned on
ř
we compare our estimation method to others without evalu- different boxes. GT indicates ground truth boxes are used, and
ating tracking. Finally, in 4.4, we compare our entire track- serves as an upper bound for accuracy. Det. indicates a detector
was used to estimate boxes. @OKS* is the OKS threshold used.
ing pipeline to other pipelines.

4.2. Improving Tracking with Pose Entailment


bounding box improvement, the AP performance is 6.6%
We compare with the optical flow tracking method [54], lower, highlighting the issue of False Negatives. The fur-
and the Graph Convolutional Network [35] (GCN) as shown ther improvement from TOKS emphasizes the usefulness of
in Figure 4. We do not compare with IoU because, GCN and estimating every pose. By using NMS, bounding box prop-
optical flow [35], [54] have shown to outperform it, nor do agation methods miss the opportunity to use the confidence
we compare to the network from [40] because it is trained in scores of the keypoints, which lead to better pose selection.
an end-to-end fashion. We follow [54] for Optical Flow and
use the pre-trained GCN provided by [35]. IDSW is calcu- 4.4. Tracking Pipeline Comparison to the SOTA
lated with three sets of keypoints. Regardless of the key- Now that we have analyzed the benefits of Pose Entail-
point AP, we find that KeyTrack's Pose Entailment main- ment and TOKS, we put them together and compare to other
tains a consistent improvement over other methods. We in- approaches. Figure 5 shows that we achieve the highest
cur approximately half as many IDSW as the GCN and 30% MOTA score. We improve over the original HRNet pa-
less than Optical Flow. per by 3.3 MOTA points on the Test set. [25], nearest our
Our improvement over GCN stems from the fact that it score on the 2018 Validation set, is much further away on
relies only on keypoint spatial locations. By using addi- the 2017 Test set. Additionally, our FPS is improved over
tional information beyond the spatial location of each key- all methods with similar MOTA scores, with many meth-
point, our model can make better inferences about the tem- ods being offline due to their use of ensembles. (Frames per
poral relationship of poses. The optical flow CNNs are not second (FPS) is calculated by diving the number of frames
specific to pose tracking and require manual tuning. For in the dataset by the runtime of the approach.) Moreover,
example, to scale the CNN’s raw output, which is normal- our method outperforms all others in terms of AP, showing
ized from -1 to 1, to pixel flow offsets, a universal constant, the benefits of TOKS. APT is also reported, which is the
given by the author of the original optical flow network (not AP score after tracking post-processing has been applied.
[54]), must be applied. However, we found that this con- This post-processing is beneficial to the MOTA score, but
stant required adjustment. In contrast, our learned method lowers AP. See section A.3 for more details on this post-
requires no tuning during inference. processing. As we have the highest AP, but not the highest
APT it appears the effect of tracking post-processing varies
4.3. Improving Detection with TOKS
from paper to paper. Only APT is given on the test set be-
Table 2 shows offers a greater improvement in keypoint cause each paper is given 4 submissions, so these are used
detection quality than other methods. In the absence of to optimize MOTA, rather than AP.

6743
PoseTrack 2018 ECCV Challenge Val Set PoseTrack 2017 Test Set Leaderboard
No. Method Extra Data APT AP FPS MOTA No. Method Extra Data APT FPS MOTA

1. KeyTrack (ours) ✗ 74.3 81.6 1.0 66.6 1. KeyTrack (ours) ✗ 74.0 1.0 61.2
2. MIPAL [25] ✗ 74.6 - - 65.7 2. POINet [40] ✗ 72.5 - 58.4
3. LightTrack (offline) [35] ✗ 71.2 77.3 E 64.9 3. LightTrack [35] ✗ 66.7 E 58.0
4. LightTrack (online) [35] ✗ 72.4 77.2 0.7 64.6 4. HRNet [45] ✗ 75.0 0.2 57.9
5. Miracle [58] ✓ - 80.9 E 64.0 5. FlowTrack [54] ✗ 74.6 0.2 57.8
6. OpenSVAI [36] ✗ 69.7 76.3 - 62.4 6. MIPAL [25] ✗ 68.8 - 54.5
7. STAF [38] ✓ 70.4 - 3 60.9 7. STAF [38] ✓ 70.3 2 53.8
8. MDPN [20] ✓ 71.7 75.0 E 50.6 8. JointFlow [15] ✗ 63.6 0.2 53.1
Figure 5. Top scores on the PoseTrack leaderboards. E indicates an ensemble of detectors is used, and results in the method being offline.
A check indicates external training data is used beyond COCO and PoseTrack. A “-” indicates the information has not been made publicly
available. FPS calculations for JointFlow and FlowTrack are taken from [59]. HRNet FPS is approximated from FlowTrack since the
methods are very similar. The AP column has the best AP score. APT is the AP score after tracking post-processing.

Figure 6. Qualitative results of KeyTrackon the PoseTrack PoseTrack 17 Test Set. Additional qualitative results are in the supplement.

Efficiency: Our tracking approach is efficient, not reliant Abs. Position Type Segment Rel. Position Match % Accuracy
on optical flow or RGB data. When processing an image at
✓ ✓ ✗ ✗ 72.6
our optimal resolution, 24x18, we reduce the GFLOPS re-
quired by optical flow, which processes images at full size, ✓ ✗ ✓ ✗ 90.0

from 52.7 to 0.1. [35]’s GCN does not capture higher-order ✓ ✓ ✓ ✗ 93.2 (ours)
interactions over keypoints and can be more efficient than ✗ ✓ ✓ ✓ 91.3
our network with local convolutions. However, this trans- ✓ ✓ ✓ ✓ 92.0
lates to a „1ms improvement in GPU runtime. In fact, our Table 3. Match accuracies for various embedding schemes.
tracking pipeline demonstrates a 30% improvement in end-
to-end runtime over [35], shown in 4.4. We have the fastest
FPS of Top-down approaches. Also, we do not rely on op-
tical flow to improve bounding box propagation as [54, 45] keypoint distance from the pose center rather than the ab-
do, instead we use TOKS. This contributes to our 5x FPS solute distance of the keypoint in the entire image. We find
improvement over [54, 45]. Further details on the param- that match accuracy deteriorates with this embedding. This
eters and FLOPS of the GCN, Optical Flow Network, and is likely because many people perform the same activity,
our Transformer Matching Network are in Table 6 of the such as running, in the PoseTrack dataset, leading to them
Supplement. having nearly identical relative pose positions. We vary the
number of transformer blocks, the hidden size in the trans-
5. Analysis former block, and number of heads in Table 7. Decreasing
5.1. Tracking Pipeline the number of transformer blocks, the hidden size, and at-
tention heads hurts performance.
Varying Tokenization Schemes and Transformer
Hyper-parameters We examine the benefits of each em- Number of Timesteps and Other Factors We find
bedding. As evident in Table 3, Segment embeddings are that reducing the number of timesteps adversely effects the
crucial because they enable the network to distinguish be- MOTA score. It drops up to 0.3 points when using only a
tween the Poses being matched. Token embeddings give single timestep because we are less robust to detection er-
the network information about the orientation of a pose and rors. Also, in replacement of our greedy algorithm, we ex-
help it interpret keypoints which are in close spatial prox- perimented with the Hungarian algorithm used in [19]. This
imity; i.e. keypoints that have the same or similar position algorithm is effective with ground truth information, but is
embedding. We also train a model that uses the relative not accurate when using detected poses.

6744
Num Tx Hidden Size Int. Size Num Heads Parameters (M) % IDSW
2 128 512 4 0.40 1.0
4 128 512 4 0.43 0.8
6 128 512 4 1.26 1.1
4 64 256 4 0.23 0.9
4 128 512 4 0.43 0.8
4 256 1024 4 3.31 1.1
4 128 128 4 0.43 0.8
4 128 512 4 0.86 0.8
4 128 128 2 0.43 0.9
4 128 128 4 0.43 0.8
4 128 128 6 0.43 0.8
Figure 7. Left: Transformer network hyper-parameters are varied. Right: A plot of IDSW rate vs. image resolution. The table on the left
shows the input to each method, the conv+visual input is blurry because images are downsampled.

CNN’s performance. In NLP, when using large vocabular-


ies, a similar behavior is observed where transformers need
multiple layers to achieve good performance. Second, we
also find that convolutions optimize more quickly than the
transformers, reaching their lowest number of ID Switches
within the first 2 epochs of training. Intuitively, CNNs
are more easily able to take advantage of spatial proxim-
ity. The transformers receive spatial information via the po-
sition embeddings, which are 1D linear projections of 2D
locations. This can be improved by using positional embed-
ding schemes that better preserve spatial information [18].
In summary, CNNs are accurate at high resolutions given
its useful properties such as translation invariance and loca-
tion invariance. However, there is an extra computational
cost of using them. The extra information, beyond the
spatial location of keypoints, included in our keypoint em-
beddings, coupled with the transformer’s ability to model
higher-order interactions allows it to function surprisingly
well at very low resolutions. Thus, the advantage of CNNs
Figure 8. Attention heatmaps from two of our network’s attention
heads are shown. These are the 0th, and 3rd heads from our final is diminished and our transformer-based network outper-
transformer. The two pairs above the dotted line are a matching forms them in the low resolution case.
pair, while the pair below the dotted line are not (and are also from
5.3. Visualizing Attention Heatmaps
separate videos). t is the frame timestep.
We visualize our network’s attention heatmaps in Fig. 8.
When our network classifies a pair as non-matching, its at-
5.2. Comparing Self-Attention to Convolutions tention is heavily placed on one of the poses over the other.
Also, we find it interesting that one of our attention heads
We compare transformers and CNNs by replacing our
primarily places its attention on keypoints near the person’s
Transformer Matching Network with two convolution-
head. This specialization suggests different attention heads
based methods. One takes visual features from bounding
are attuned to specific keypoint motion cues.
box pose pairs as input while the other takes only key-
points as input, where each unique keypoint is colored via
6. Conclusion
a linear interpolation, a visual version of our Type tokens.
Both approaches use identical CNNs, sharing an architec- In summary, we present an efficient Multi-person Pose
ture inspired by VGG [43], and have approximately 4x more Tracking method. Our proposed Pose Entailment method
parameters than our transformer-based model because this achieves SOTA performance on PoseTrack datasets without
was required for stable training. See A.4 of the Supplement using RGB information in the tracking step. KeyTrack also
for details. benefits from improved keypoint estimates using TOKS,
Transformers outperform CNNs for the tracking task, which outperforms bounding box propagation methods. Fi-
as shown in Figure 7. However, we find two areas where nally, we demonstrate how to tokenize and embed human
CNNs can be competitive. First, at higher resolutions, trans- pose information in the transformer architecture that has ap-
formers often need a large number of parameters to match plications to tasks such as pose-based action recognition.

6745
References [16] Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorial
structures for object recognition. International journal of
[1] Posetrack leaderboard, 2017 test set, 2017. 2, 5 computer vision, 61(1):55–79, 2005. 2
[2] Posetrack challenge - eccv 2018, 2018. 5 [17] Mihai Fieraru, Anna Khoreva, Leonid Pishchulin, and Bernt
[3] Abdulrahman Alarifi, AbdulMalik Al-Salman, Mansour Al- Schiele. Learning to refine human pose estimation. In Pro-
saleh, Ahmad Alnafessah, Suheer Al-Hadhrami, Mai A Al- ceedings of the IEEE Conference on Computer Vision and
Ammar, and Hend S Al-Khalifa. Ultra wideband indoor po- Pattern Recognition Workshops, pages 205–214, 2018. 2
sitioning technologies: Analysis and recent advances. Sen- [18] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zis-
sors, 16(5):707, 2016. 2 serman. Video action transformer network. In Proceedings
[4] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, of the IEEE Conference on Computer Vision and Pattern
Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt Recognition, pages 244–253, 2019. 3, 8
Schiele. Posetrack: A benchmark for human pose estima- [19] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani,
tion and tracking. In CVPR, 2018. 2, 5 Manohar Paluri, and Du Tran. Detect-and-track: Efficient
[5] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pic- pose estimation in videos. In Proceedings of the IEEE Con-
torial structures revisited: People detection and articulated ference on Computer Vision and Pattern Recognition, pages
pose estimation. In 2009 IEEE conference on computer vi- 350–359, 2018. 2, 3, 7
sion and pattern recognition, pages 1014–1021. IEEE, 2009. [20] Hengkai Guo, Tang Tang, Guozhong Luo, Riwei Chen,
2 Yongchen Lu, and Linfu Wen. Multi-domain pose network
[6] Mykhaylo Andriluka1, Leonid Pishchulin, Peter Gehler, and for multi-person pose estimation and tracking. Computer Vi-
Bernt Schiele. 2d human pose estimation - mpii human pose sion – ECCV 2018 Workshops, page 209–216, 2019. 2, 3,
dataset. In CVPR, 2014. 5 7
[7] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple [21] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
object tracking performance: the clear mot metrics. Journal shick. Mask r-cnn. In Proceedings of the IEEE international
on Image and Video Processing, 2008:1, 2008. 5 conference on computer vision, pages 2961–2969, 2017. 2
[22] Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor
[8] Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo
Darrell, and Kate Saenko. Modeling relationships in ref-
Shi, and Lorenzo Torresani. Learning temporal pose es-
erential expressions with compositional modular networks.
timation from sparsely-labeled videos. arXiv preprint
In Proceedings of the IEEE Conference on Computer Vision
arXiv:1906.04016, 2019. 2
and Pattern Recognition, 2016. 3
[9] Samuel R Bowman, Gabor Angeli, Christopher Potts,
[23] Hao Huang, Luowei Zhou, Wei Zhang, Jason J Corso, and
and Christopher D Manning. A large annotated corpus
Chenliang Xu. Dynamic graph modules for modeling object-
for learning natural language inference. arXiv preprint
object interactions in activity recognition: Supplementary
arXiv:1508.05326, 2015. 3
material. BMVC, 2019. 3
[10] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang
[24] Shaoli Huang, Mingming Gong, and Dacheng Tao. A coarse-
Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network
fine network for keypoint localization. In Proceedings of the
for multi-person pose estimation. In Proceedings of the IEEE
IEEE International Conference on Computer Vision, pages
Conference on Computer Vision and Pattern Recognition,
3028–3037, 2017. 2
pages 7103–7112, 2018. 2
[25] Jihye Hwang, Jieun Lee, Sungheon Park, and Nojun Kwak.
[11] Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual re- Pose estimator and tracker using temporal flow maps for
lationships with deep relational networks. In Proceedings limbs, 2019. 2, 6, 7
of the IEEE Conference on Computer Vision and Pattern
[26] Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin,
Recognition, 2017. 3
Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt
[12] Matthias Dantone, Juergen Gall, Christian Leistner, and Luc Schiele. Arttrack: Articulated multi-person tracking in the
Van Gool. Human pose estimation using body parts depen- wild. In Proceedings of the IEEE Conference on Computer
dent joint regressors. In Proceedings of the IEEE Conference Vision and Pattern Recognition, pages 6457–6465, 2017. 2,
on Computer Vision and Pattern Recognition, pages 3041– 3
3048, 2013. 2 [27] Umar Iqbal, Anton Milan, and Juergen Gall. Posetrack: Joint
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina multi-person pose estimation and tracking. In Proceedings
Toutanova. BERT: pre-training of deep bidirectional trans- of the IEEE Conference on Computer Vision and Pattern
formers for language understanding. CoRR, abs/1810.04805, Recognition, pages 2011–2020, 2017. 2, 3
2018. 3 [28] Sheng Jin, Wentao Liu, Wanli Ouyang, and Chen Qian.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Multi-person articulated tracking with spatial and tempo-
Toutanova. Bert: Pre-training of deep bidirectional ral embeddings. In Proceedings of the IEEE Conference
transformers for language understanding. arXiv preprint on Computer Vision and Pattern Recognition, pages 5664–
arXiv:1810.04805, 2018. 2 5673, 2019. 2
[15] Andreas Doering, Umar Iqbal, and Juergen Gall. Joint flow: [29] Sheng Jin, Xujie Ma, Zhipeng Han, Yue Wu, Wei Yang,
Temporal flow fields for multi person tracking, 2018. 7 Wentao Liu, Chen Qian, and Wanli Ouyang. Towards multi-

6746
person pose tracking: Bottom-up and top-down methods. In [43] Karen Simonyan and Andrew Zisserman. Very deep convo-
ICCV PoseTrack Workshop, volume 2, page 7, 2017. 3 lutional networks for large-scale image recognition, 2014. 8
[30] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, [44] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia
and Kai-Wei Chang. Visualbert: A simple and perfor- Schmid. Contrastive bidirectional transformer for temporal
mant baseline for vision and language. arXiv preprint representation learning. arXiv preprint arXiv:1906.05743,
arXiv:1908.03557, 2019. 3 2019. 3
[31] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. [45] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
Vilbert: Pretraining task-agnostic visiolinguistic represen- high-resolution representation learning for human pose esti-
tations for vision-and-language tasks. arXiv preprint mation. CoRR, abs/1902.09212, 2019. 1, 2, 3, 7
arXiv:1908.02265, 2019. 3 [46] Hao Tan and Mohit Bansal. Lxmert: Learning cross-
[32] Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan modality encoder representations from transformers. arXiv
AlRegib, and Hans Peter Graf. Attend and interact: Higher- preprint arXiv:1908.07490, 2019. 3
order object interactions for video understanding. In Pro- [47] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun,
ceedings of the IEEE Conference on Computer Vision and and Christoph Bregler. Efficient object localization using
Pattern Recognition, pages 6790–6800, 2018. 3 convolutional networks. In Proceedings of the IEEE Con-
[33] Anton Milan, Laura Leal-Taixé, Ian D. Reid, Stefan Roth, ference on Computer Vision and Pattern Recognition, pages
and Konrad Schindler. MOT16: A benchmark for multi- 648–656, 2015. 1, 2
object tracking. CoRR, abs/1603.00831, 2016. 5 [48] Alexander Toshev and Christian Szegedy. Deeppose: Human
[34] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. pose estimation via deep neural networks. In Proceedings of
Posefix: Model-agnostic general human pose refinement net- the IEEE conference on computer vision and pattern recog-
work. In Proceedings of the IEEE Conference on Computer nition, pages 1653–1660, 2014. 1, 2
Vision and Pattern Recognition, pages 7773–7781, 2019. 2 [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
[35] Guanghan Ning and Heng Huang. Lighttrack: A generic reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
framework for online top-down human pose tracking. arXiv Polosukhin. Attention is all you need, 2017. 1, 5
preprint arXiv:1905.02822, 2019. 1, 2, 6, 7 [50] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning
[36] Guanghan Ning, Ping Liu, Xiaochuan Fan, and Chi Zhang. correspondence from the cycle-consistency of time. In Pro-
A top-down approach to articulated human pose estimation ceedings of the IEEE Conference on Computer Vision and
and tracking. Computer Vision – ECCV 2018 Workshops, Pattern Recognition, pages 2566–2576, 2019. 2
page 227–234, 2019. 7 [51] Yang Wang and Greg Mori. Multiple tree models for oc-
[37] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander clusion and spatial constraints in human pose estimation. In
Toshev, Jonathan Tompson, Chris Bregler, and Kevin Mur- European Conference on Computer Vision, pages 710–724.
phy. Towards accurate multi-person pose estimation in the Springer, 2008. 2
wild. In Proceedings of the IEEE Conference on Computer [52] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser
Vision and Pattern Recognition, pages 4903–4911, 2017. 2 Sheikh. Convolutional pose machines. In Proceedings of the
[38] Yaadhav Raaj, Haroon Idrees, Gines Hidalgo, and Yaser IEEE Conference on Computer Vision and Pattern Recogni-
Sheikh. Efficient online multi-person 2d pose tracking with tion, pages 4724–4732, 2016. 1, 2
recurrent spatio-temporal affinity fields. In Proceedings [53] Fangting Xia, Peng Wang, Xianjie Chen, and Alan L Yuille.
of the IEEE Conference on Computer Vision and Pattern Joint multi-person pose estimation and semantic part seg-
Recognition, pages 4620–4628, 2019. 2, 7 mentation. In Proceedings of the IEEE Conference on Com-
[39] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan puter Vision and Pattern Recognition, pages 6769–6778,
Bello, Anselm Levskaya, and Jonathon Shlens. Stand- 2017. 2
alone self-attention in vision models. arXiv preprint [54] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines
arXiv:1906.05909, 2019. 3 for human pose estimation and tracking. In European Con-
[40] Weijian Ruan, Wu Liu, Qian Bao, Jun Chen, Yuhao Cheng, ference on Computer Vision (ECCV), 2018. 5, 6, 7
and Tao Mei. Poinet: Pose-guided ovonic insight network for [55] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and
multi-person pose tracking. In Proceedings of the 27th ACM Cewu Lu. Pose flow: Efficient online pose tracking. arXiv
International Conference on Multimedia, MM ’19, pages preprint arXiv:1802.00977, 2018. 3
284–292, New York, NY, USA, 2019. ACM. 2, 6, 7 [56] Jiarui Xu, Yue Cao, Zheng Zhang, and Han Hu. Spatial-
[41] Adam Santoro, David Raposo, David GT Barrett, Mateusz temporal relation networks for multi-object tracking. arXiv
Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy preprint arXiv:1904.11489, 2019. 3
Lillicrap. A simple neural network module for relational rea- [57] Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and
soning. In Advances in Neural Information Processing Sys- Xiaogang Wang. Learning feature pyramids for human pose
tems, 2017. 3 estimation. In Proceedings of the IEEE International Con-
[42] Leonid Sigal and Michael J Black. Measure locally, reason ference on Computer Vision, pages 1281–1290, 2017. 1, 2
globally: Occlusion-sensitive articulated pose estimation. In [58] Dongdong Yu, Kai Su, Jia Sun, and Changhu Wang. Multi-
2006 IEEE Computer Society Conference on Computer Vi- person pose estimation for pose tracking with enhanced cas-
sion and Pattern Recognition (CVPR’06), volume 2, pages caded pyramid network. In European Conference on Com-
2041–2048. IEEE, 2006. 2 puter Vision, pages 221–226. Springer, 2018. 7

6747
[59] Jiabin Zhang, Zheng Zhu, Wei Zou, Peng Li, Yanwei Li,
Hu Su, and Guan Huang. Fastpose: Towards real-time pose
estimation and tracking via scale-normalized multi-task net-
works, 2019. 7
[60] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Ja-
son J Corso, and Jianfeng Gao. Unified vision-language
pre-training for image captioning and vqa. arXiv preprint
arXiv:1909.11059, 2019. 3
[61] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher,
and Caiming Xiong. End-to-end dense video captioning with
masked transformer. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 8739–
8748, 2018. 3

6748

You might also like