0% found this document useful (0 votes)

10 views

2103.02440v2

The document presents OpenPifPaf, a novel framework for real-time semantic keypoint detection and spatio-temporal association, enabling efficient pose estimation and tracking. It introduces a new architecture utilizing Composite Fields and a Temporal Composite Association Field (TCAF) to improve accuracy and speed in detecting and tracking multiple keypoints across frames. The method demonstrates competitive performance on various datasets, generalizing well to different classes of keypoints, making it suitable for applications in urban mobility such as self-driving cars and delivery robots.

Uploaded by

afakharany9social

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

2103.02440v2

Uploaded by

afakharany9social

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

1

OpenPifPaf:
Composite Fields for Semantic Keypoint Detection
and Spatio-Temporal Association
Sven Kreiss, Lorenzo Bertoni, Alexandre Alahi

Abstract—Many image-based perception tasks can be formu-

arXiv:2103.02440v2 [cs.CV] 21 Sep 2021

lated as detecting, associating and tracking semantic keypoints,

e.g., human body pose estimation and tracking. In this work,
we present a general framework that jointly detects and forms
spatio-temporal keypoint associations in a single stage, making
this the first real-time pose detection and tracking algorithm.
We present a generic neural network architecture that uses
Composite Fields to detect and construct a spatio-temporal pose
which is a single, connected graph whose nodes are the semantic
keypoints (e.g., a person’s body joints) in multiple frames. For
the temporal associations, we introduce the Temporal Composite
Association Field (TCAF) which requires an extended network
architecture and training method beyond previous Composite
Fields. Our experiments show competitive accuracy while being
an order of magnitude faster on multiple publicly available Fig. 1. A real-world scene from the perspective of a self-driving car.
datasets such as COCO, CrowdPose and the PoseTrack 2017 Schematically, all moving actors are detected with their poses and tracked
and 2018 datasets. We also show that our method generalizes to so that they can be consistently quantified over time. We place particular
any class of semantic keypoints such as car and animal parts to emphasis on understanding humans but also show generalizations to animals
provide a holistic perception framework that is well suited for and cars. Here, a car (tracked as 4) is running a red light while also swerving
urban mobility such as self-driving cars and delivery robots. to the right to avoid a woman (tracked as 2) who is walking her dog (tracked
as 3).
Index Terms—composite fields, pose estimation, pose tracking.

be viable for self-driving cars. Tracking human poses in real-

time will enable self-driving cars to develop a finer-grained
I. I NTRODUCTION understanding of pedestrian behavior and with that a better
The computer vision community has made tremendous conditioned reasoning for more natural driving.
progress in solving fine-grained perception tasks such as The problem is to estimate and track multiple human, car
human body joints detection and tracking [1], [2]. We can and animal poses in image sequences, see Figure 1. The major
cast these tasks as detecting, associating and tracking semantic challenges for tracking poses from the car perspective are
keypoints. Examples of semantic keypoints are “left shoul- (i) occlusions due to the viewing angle and (ii) prediction
ders”, “right knees” or the “left brake lights of vehicles” speed to be able to react to real-time changes in the environ-
as opposed to keypoints used in classical feature detectors ment. Our method must be fast enough to be viable for self-
that focus on the local geometry of the pixel intensities, driving cars and robust to real-world variations like lighting,
like “corners” and “edges”. However, the performance of weather and occlusions.
semantic keypoint tracking in live video sequences has been Although tracking has been studied extensively before hu-
limited in accuracy and high in computational complexity and man pose estimation [4], [5], [6], [7], a significant cornerstone
prevented applications to the transportation domain with real- that leverages poses are the works of Insafutdinov et al. [8]
time requirements like self-driving cars and last-mile delivery and Iqbal et al. [9] who pioneered multi-person pose tracking
robots. The majority of self-driving car accidents is caused for an arbitrary number of people in the wild. Both methods
by “robotic” driving where the self-driving car conducts an use graph matching to track independent, single-frame poses
allowed but unexpected stop and a human driver crashes into over time. To improve the matching for tracking, Doering
the self-driving car [3]. At their core, self-driving cars lack et al. [10] introduced temporal flow fields that improve the
social intelligence. They are blind to the body language of cost function for matching. However, these works treat pose
surrounding pedestrians when every person is only perceived tracking as a multi-stage process: infer single-frame poses –
as a bounding box. Current pose detection and tracking meth- which is itself a multi-stage process for top-down methods –
ods are neither fast enough nor robust enough to occlusions to and connect poses from frame to frame. This prohibits any
improvement to single-frame poses that could result from the
Sven Kreiss, Lorenzo Bertoni and Alexandre Alahi are at the VITA lab at temporal information available in tracking. Here, we address
EPFL, 1015 Lausanne, Switzerland, e-mail: [email protected] these challenges by introducing a new method that jointly
2

solves pose detection and tracking with Composite Fields. Doering et al. [10] were the first to introduce a method that
First, we review Composite Fields for single-image multi- is bottom-up in both the spatial and the temporal part. They
person pose estimation [11]. Second, we introduce a new employ Part Affinity Fields [16] for the single-frame poses in
method for pose tracking. While single-frame pose estimation a Siamese architecture. The temporal flow fields (TFF) feed
can be viewed as a pose completion task starting at a seed into an edge cost computation for bipartite graph matching
joint, we treat pose tracking as a pose completion task starting for tracking. The idea is extended in MIPAL [26] for tracking
with a pose from a previous frame and completing a spatio- limbs instead of joints and in STAF [27].
temporal pose, which is a single, connected graph that spans Early work on multi-person pose tracking started with [8],
space and time. The spatio-temporal pose consists of at least [9]. Recent work has shown excellent performance on the
two single-frame poses and additional connections across the PoseTrack 2018 dataset including the top-down method
frames. openSVAI [28] which decomposes the problem into three
The contributions of this paper are (i) a Temporal Composite independent stages of human candidate detection, single-
Association Field (TCAF) which we use to form a spatio- image human pose estimation and pose tracking. Similarly,
temporal pose and (ii) a greedy decoder to jointly detect and LightTrack [29] also builds a strong top-down pipeline with
track poses. To the best of our knowledge, this method is interchangeable and independent modules. Miracle [30] uses
the first single-stage, bottom-up pose detection and tracking a strong single-image pose estimator with a cascaded pyramid
method. We outperform all previous methods in accuracy and network together with an IOU tracker. HRNet for human
speed on the CrowdPose dataset [12] with its particularly pose estimation [20] leverages a multi-resolution backbone to
crowded images. We perform on par with the state-of-the-art produce high resolution feature maps that are context aware
bottom-up method for single-image human pose estimation on via HRNet’s multi-scale fusion. In MSRA/FlowTrack [19],
the COCO [2] keypoint task in precision and are an order optical flow is used to improve top-down tracking of bounding
of magnitude faster in speed. Our model performs on par boxes for tracking of human poses. Pose-Guided Grouping
with the state-of-the-art method for human pose tracking on (PGG) [31] proposes a part association method based on
PoseTrack 2017 and 2018 [13] while simultaneously being separate spatial and temporal embeddings. KeyTrack [32]
an order of magnitude faster during prediction. We also uses pose tokenization and a transformer network to associate
show that our method generalizes to car and animal poses poses.
which demonstrates its suitability for a holistic perception
framework. Our method is implemented as an open source C. Beyond Humans
library, referred to as OpenPifPaf 1 . While many state-of-the-art methods focused on human
body pose detection and tracking, the research community has
II. R ELATED W ORK recently studied their performance on other classes such as
A. Pose Estimation animals and cars. Pose estimation research for animals and
State-of-the-art methods for pose estimation are based on cars has to deal with additional challenges: limited labeled
Convolutional Neural Networks [14], [15], [16], [17], [18], data [33] and large number of self-occlusions [34].
[19], [20], [21], [22], [23], [24]. All approaches for human For animals, datasets are usually small and include limited
pose estimation can be grouped into bottom-up and top-down animal species [35], [33], [36], [37], [38]. To overcome this
methods. The former estimates each body joint first and then issue, DeepLabCut [39] and WS-CDA [33] have developed
groups them to form a unique pose. The latter runs a person transfer learning techniques from humans to animals. Mu et
detector first and estimates body joints within the detected al. [40] have generated a synthetic dataset from CAD animal
bounding boxes. Bottom-up methods were pioneered, e.g., by models and proposed a technique to bridge the real-synthetic
Pishchulin et al. with DeepCut [25]. In their work, the part domain gap. Another line of work has extended the human
association is solved with an integer linear program leading SMPL model [41] to animals to learn simultaneously pose
to processing times for one image of the order of hours. and shape of endangered animals [42], [43], [44].
Newer methods use greedy decoders in combination with For cars, self-occlusions between keypoints are inevitable.
additional tools to reduce prediction time as in Part Affinity A few methods improve performances by estimating 2D and
Fields [16], Associative Embedding [17], PersonLab [18] 3D keypoints of vehicles together. Occlusion-net [34] uses a
and multi-resolution networks with associate embedding [24]. 3D graph network with self-supervision to predict 2D and
PifPaf [11] introduced composite fields for pose estimation 3D keypoints of vehicles using the CarFusion dataset [45],
that produces a more precise association between joints than while GSNet [46] predicts 6DoF car pose and reconstructs
OpenPose’s Part Affinity Fields [16] and PersonLab’s mid- dense 3D shape simultaneously. Without 3D information, the
range fields [18]. In the next section, we will review composite popular OpenPose [47] shows qualitative results for vehicles
fields and show that they generalize to tracking tasks. and Simple Baseline [48] extends a top-down pose estimator
for cars on a custom dataset based on Pascal3D+ [49].
B. Pose Tracking
III. C OMPOSITE F IELDS
Tracking algorithms can be grouped into top-down versus
Our method relies on the Composite Fields formalism to
bottom-up approaches for the pose part and the tracking part.
jointly detect and track semantic keypoints. Hereafter, we
1 https://github.com/vita-epfl/openpifpaf posetrack briefly present them.
3

(a) (b) (c)

Fig. 2. Visualizing the components of the CIF for the “left shoulder” keypoint on a small image crop. The confidence map is shown in (2a). The vector field
with joint-scale estimates is shown in (2b). Only locations with confidence > 0.5 are drawn. The fused confidence, vector and scale components according
to Equation 1 are shown in (2c).

(a) (b)
Fig. 3. Visualizing the components of the CAF that associates left shoulder with left hip. This is one of the 18 CAF. Every location of the feature map is
the origin of two vectors which point to the shoulders and hips to associate. The confidence of associations ac is shown at their origin in (3a) and the vector
components for ac greater than 0.5 are shown in (3b).

components vx , vy is {s, vx , vy }ij . This is equivalent to “over-

laying” a confidence map with a vector field if the ground
truth is aligned. This equivalence is trivial in this example
but becomes more subtle when we discuss association fields
below.
b) Composite Intensity Fields (CIF): The Composite
(a) (b) (c) Intensity Fields (CIF) characterize the intensity of semantic
Fig. 4. Common association fields between two joints. Joints are visualized keypoints. The composite structure is based on [53] with the
as gray circles. Part Affinity Fields (a) as used in OpenPose [16] are unit
vectors indicating a direction towards the next joint. Mid-range fields (b) as
extension of a scale σ to characterize the keypoint size. This is
used in PersonLab [18] are vectors originating in the vicinity of a source joint identical to the part intensity field in [11]. We use the notation
and point to the target joint. Our Composite Association Field (c) regresses pij ij
J = {c, x, y, b, σ}J where J is a particular body joint type,
both source and target points and additionally predicts their joint size which
are visualized with blue squares.
c is the confidence, x and y are regressed coordinates, b is the
uncertainty in the location and σ is the size of the joint.
Figure 2 shows the components of a CIF field and a
a) Field Notation: Fields are functions over locations high resolution accumulation of the predicted intensity. The
(e.g., feature map cells) and their outputs are primitives like field is coarse with a stride of 16 with respect to the input
scalars or composites. Composite Fields as introduced in [11] image but the accumulated intensity is at high resolution. The
jointly predict multiple variables of interest, for example, the high resolution confidence map f (v, w) is a convolution of
confidence, precise location and size of a semantic keypoint an unnormalized Gaussian kernel N with width σ over the
(e.g., body joint). regressed targets from the Composite Intensity Field x and y
We will enumerate the spatial output coordinates of the weighted by its confidence c:
neural network with i, j and reserve x, y for real-valued X ij
coordinates in the input image. A field over (i, j) is denoted fJ (v, w) = cJ N (v, w|xij ij ij
J , yJ , σJ ) (1)
ij
with fij and can have scalar, vector or composite values.
For example, the composite field of scalars s and 2D vector where v and w are real-valued coordinates in the image. This
4

Fig. 5. Model architecture. The input is an image batch of size (H, W ) with three color channels, indicated by “x3”. During joint training on multiple
datasets, the datasets produce image pairs (black arrows for current image at t0 and red arrows for image at t−1 ) whereas during evaluation they produce
single images in a sequence. The neural network based encoder produces composite fields for M joints and N connections. An operation with stride two is
indicated by “//2”. The shared backbone is a ResNet [50] or ShuffleNetV2 [51] without max-pooling. The Feature Cache is only used during evaluation and
injects for every image the previous feature map into the batch. We use a single 1 × 1 convolution in each head network. The TCAF head networks have a
shared pre-processing step consisting of a feature reduction to 512 with a 1 × 1 convolution followed by ReLU, a concatenation of the two feature maps and
another 1 × 1 convolution with ReLU activation. For optional spatial upsampling, we append a sub-pixel convolution layer [52] to each head network. The
decoder converts a set of composite fields into pose estimates. Each semantic keypoint is represented by a confidence score, a real-valued (x, y) coordinate
pair and a size estimate.

accumulation incorporates information of the confidence c, the associations – something that Part Affinity Fields [16] cannot
precisely regressed spatial location (x, y) and the predicted do – which becomes particularly important when we introduce
joint size σ. This map fJ is used to seed the pose decoder our extension for tracking.
and to rescore predicted CAF associations.
c) Composite Association Fields (CAF): Efficiently
forming associations is the core challenge for tracking multiple IV. M ETHOD
poses in a video sequence. The most difficult cases are
crowded scenes and camera angles where people occlude other We aim to present a method that can detect, associate
people – as is the case in the self-driving car perspective where and track semantic keypoints in videos efficiently. We place
pedestrians occlude other pedestrians. Top-down methods first particular emphasis on urban and crowded scenes that are
estimate bounding boxes and then do single-person pose difficult for autonomous vehicles. Many previous methods
estimation per bounding box. This assumes non-overlapping struggle when object bounding boxes overlap. In bird-eye
bounding boxes which is not given in our scenario. Therefore, views from drones or security cameras, bounding boxes are
we focus on bottom-up methods. more separated than in a car driver’s perspective. Here, top-
In [11], we introduced Part Association Fields to connect down methods struggle. Previous bottom-up methods have
joint locations together into poses. Here, we extend this field been trailing top down methods in accuracy without improving
with joint-scale components and call it Composite Association on performance either. Our bottom-up method is efficient,
Field (CAF) to distinguish it better from Part Affinity Fields employs a stable field representation and has high accuracy
introduced in [16]. A graphical review of association fields is and performance that even surpasses top-down methods.
shown in Figure 4 and shows that our CAF expresses the most Figure 5 presents our model architecture. It is a shared
detail about an association. ResNet [50] or ShuffleNetV2 [51] base network without
CAFs predict a confidence, two vectors to the two parts max-pooling. The head networks are shallow and not shared
this association is connecting, two spreads b for the spatial between datasets. In our examples, each dataset has a head
precisions of the regressions (details in Section IV-A) and network for joint intensities (Composite Intensity Field – CIF)
two joint sizes σ. CAFs are represented with aij and a head network for associations (Composite Association
J1 ↔J2 =
Field – CAF). Beyond CIF and CAF, additional head networks
{c, x1 , y1 , x2 , y2 , b1 , b2 , σ1 , σ2 }ij
J1 ↔J2 where J 1 ↔ J2 is the
can be added. In Section IV-B, we introduce the new Temporal
association between body joints J1 and J2 . Predicted asso-
Composite Association Field (TCAF) which is predicted by an
ciations between left shoulders and left hips are shown for
additional head network to facilitate pose tracking.
an example image in Figure 3. In our representation of an
We will introduce a tracking method that is a direct ex-
association, physically meaningful quantities are regressed to
tension of single-image pose estimation. Therefore, we first
continuous variables and do not suffer from the discreteness
introduce our method for single-image pose estimation with
of the feature map. In addition, it is important to represent
particular emphasis on details that will be relevant for pose
associations between two joints that are at the same pixel
tracking.
location. Our representation is stable for these zero-distance
5

than five. The CAF loss has the same structure but with two
localization components (3) and two scale components (4).
b) Self-Hidden Keypoint Suppression: The COCO eval-
uation metric treats visible and hidden keypoints in the same
manner. As in [11], we include hidden keypoints in our
training. However, when a visible and a hidden keypoint
appear close together, we remove the hidden keypoint from the
ground truth annotation so that this keypoint is not included
in associations. In Figure 6, we show the effect of excluding
these self-hidden keypoints from training and observe better
pose reconstruction when a keypoint hides another keypoint
of the same type.
Fig. 6. Effect of self-hidden keypoint suppression during training. The left c) Greedy Decoder with Frontier: The composite fields
image is without and the right image is with self-hidden keypoint suppression. are converted into sets of pose estimates with the greedy
The left hips of both soccer players collide in pixel space. decoder introduced in [11] and reviewed here. The CIF field
and its high-resolution accumulation f (x, y) defined in equa-
tion 1 provide seed locations. Previously, new associations
A. Single-Image Pose Estimation were formed starting at the joint that has currently the highest
a) Loss Functions for Composite Fields: Human pose score without taking the CAF confidence of the association
estimation algorithms tend to struggle with the diversity of into account. Here, we introduce a frontier which is a priority
scales that a human pose can have in an image. While a queue of possible next associations. The frontier is ordered by
localization error for the joint of a large person can be minor, the possible future joint scores which are a function of the
that same absolute error might be a major mistake for a small previous joint score and the best CAF association:
person. Our loss is the logarithm of the probability that all
||~x − (x1 , y1 )||2

ij
components are “well” predicted, i.e., it is the sum of the log- max s(aJ1 ↔J2 , ~x) = c exp − fJ2 (x2 , y2 )
ij σ1
probabilities for the individual components. Each component (6)
follows standard loss prescriptions. We use binary cross en- where ~x is the source joint location, aij J1 ↔J2 =
tropy (BCE) for classification with a Focal loss modification (c, x1 , y1 , x2 , y2 , σ1 , σ2 ) is the CAF field with implied
w [54]. To regress locations in the image, we use the Laplace sub-/superscripts on the components and fJ2 is the high
loss [55] which is an L1 -type loss that is attenuated by a resolution confidence map of the target joint J2 . An
predicted spread b̂ in the location. To regress additional scale association is rejected when it fails reverse matching. To
components (keypoint sizes), we use a Laplace loss with a reduce jitter, we not only use the best CAF association in
fixed spread bσ = 3. The CIF loss function is: the above equation but a weighted mixture of the best two
X associations; similar to blended connections in [56]. Only
LCIF = w(c, ĉ)BCE(c, ĉ) (2) when all possible associations are added to the frontier, the
mc
X1 connection is made to the highest priority in the frontier. This
+ L2 (v, v̂, bmin ) + log b̂ (3) algorithm is fast and greedy. Once a connection to a new
mv b̂ joint has been made, this decision is final.
X 1 ŝ d) Instance Score and Non-Maximum Suppression
+ 1− (4) (NMS): Once all poses are reconstructed, we apply NMS.
m
bs s
s
Poses are first sorted by their instance score which is the
with its three parts for confidence (2), localization (3) and weighted mean of the keypoint scores where the three highest
scale (4) and where: keypoint scores are weighted three times higher. We run NMS
q at the keypoint level as in [11], [18]. The suppression radius
L2 (v, v̂, bmin ) = (v1 − v̂1 )2 + (v2 − v̂2 )2 + b2min . (5) is dynamic and based on the predicted joint size. We do not
refine predictions.
The sums are over masked feature cells mc , mv and mσ with e) Denser Pose Skeletons: Figure 7 gives an overview
i, j, J implied. The mask for confidence mc is almost the entire of the pose skeletons that are used in this paper. In particular,
image apart from regions annotated as “crowd regions” [2]. Figure 7b shows a modification of the standard COCO pose [2]
The masks for localization mv and for scale mσ are only with additional associations. These denser associations are
active in a 4 × 4 window around the ground truth keypoint. redundancies in case of occlusions. The additional associations
Per feature map cell, there is a ground truth confidence c and are longer-range and therefore harder to predict. The frontier in
its predicted counterpart ĉ. The predicted location v̂ = (v̂1 , v̂2 ) our greedy decoder takes this difficulty into account and auto-
is optimized with a Laplace loss with a predicted spread matically prefers easier, confident associations when available.
b̂ for heteroscedastic aleatoric uncertainty [55] with respect Qualitatively, the advantage of dense associations is shown
to the ground truth location v. A bmin = 1px is added to in Figure 8. With the standard COCO skeleton, the single
prevent exploding losses when the spread becomes too small. person’s pose skeleton would be divided into two disconnected
For stability, we clip the BCE loss when it becomes larger parts (left image) as indicated by the two white bounding
6

Fig. 8. Left: A sparse pose cannot connect the right arm to the facial keypoints
(a) (b) (c) (d) leading to the detection of two separate person instances highlighted by the
two white bounding boxes. Right: An additional dense connection between
Fig. 7. A COCO person pose [2] is shown in (a). Additional denser the nose and right shoulder leads to a correctly identified single pose.
connections are shown in lighter colors in (b). The additional connections
provide redundancies in case of occlusions. A pose skeleton as used in
Posetrack with temporal connections is shown in (c). An example of a tracked
pose is shown in (d). The first frame is captured with the right leg (blue) c) From Spatio-Temporal Poses to Tracks: Spatio-
in front and the second frame one step later. For clarity, only connections temporal poses create temporal associations in pairs of images.
that were used to decode the pose are shown and therefore only the temporal
connection that is connecting the right ankle from the past frame to the current
We now introduce our book-keeping method to go from pairs
frame is visible. of images to image sequences. During evaluation and for a
new frame t0 , the decoder creates new tracking poses from
existing tracks (poses in the previous frame t−1 ) or from
boxes. With the additional denser associations, a single pose single-image seeds in the current frame t0 . These partial poses
is formed (right image). are then completed using the same greedy frontier decoder
described for single images. Once all spatio-temporal poses are
B. Pose Tracking complete, the t0 joints are extracted into single-frame poses.
In the previous section we introduced our method for Every single-frame pose is already tagged with an existing
bottom-up pose estimation in single images. We now general- track-id if the spatio-temporal pose was generated from an
ize that method to tracking poses in videos with associations existing track or a new track-id if the spatio-temporal pose
between images in the same bottom-up fashion. Our unified originated from a new seed in the current frame. The single-
approach forms both spatial and temporal associations simulta- frame poses are then filtered with soft NMS [18] and then
neously. This even leads to improved single-image poses from either added to existing tracks or they become the first poses
the additional temporal information. of new tracks.
a) Temporal Composite Association Field (TCAF): Dur- Our method is bottom-up in both pose estimation and
ing training, tracking data is fed into the base network as tracking and estimates temporal and spatial connections within
image pairs that are concatenated in the batch dimension, i.e., a single stage. Most existing work – even other bottom-up
a batched input tensor of eight image pairs has the same shape tracking methods [10], [26] – employ a two stage process
as 16 individual images. where, first, spatial connections are estimated and, second,
During prediction, the backbone processes one image at a temporal connections are made.
time and each image only once. The resulting feature map
is then concatenated with the previous feature map from the V. E XPERIMENTS
“Feature Cache” (see Figure 5). While there is still duplicate Self-driving cars must perceive and predict pedestrians and
computation in the head networks, their computational com- other traffic participants robustly. One of the most challenging
plexity is small. scenarios are crowded places. We will first show experiments
To form associations in image sequences, we introduce the on single-image human pose estimation in CrowdPose [12]
Temporal Composite Association Field (TCAF). Its output which contains particularly challenging scenarios and on the
structure is identical to a CAF field, but its input is based standardized and competitive COCO [2] person keypoint
on pairs of feature maps that were created independently. To benchmark. Then we will show results for pose tracking in
jointly process information from both feature maps, the TCAF videos on the PoseTrack 2017 [9] and 2018 [13] datasets. We
head contains a preprocessing step of a 1×1 input convolution have conducted extensive experiments to show the benefit of
to reduce the feature size to 512 with ReLU non-linearity, a unified bottom-up pose estimation and tracking method with
a concatenation of these two feature maps to 1024 features, spatio-temporal poses. To demonstrate the universality of our
a 1 × 1 convolution with ReLU to process the two images approach, we apply our method also to poses of cars and poses
jointly and a final 1×1 convolution to produce all components of animals.
necessary for a composite association field.
b) Spatio-Temporal Poses: Figure 7c shows a schematic
of a person pose (17 joints and 18 associations) with additional A. Datasets
temporal connections to all joints of the same kind in the a) CrowdPose: In [12], the CrowdPose dataset is pro-
previous frame. In our method, this is treated as a single pose posed. It is a selection of images from other datasets with
with 2 × 17 joints (CIF) and 18 associations (CAF) within the a particular emphasis on how crowded the images are. The
same frame and an additional 17 associations (TCAF) between crowd-index of an image represents the amount of overlap
frames. between person bounding boxes. The authors place particular
7

emphasis on a uniform distribution of the crowd-index in to (96 px)2 and for large instances with a bounding box area
all data partitions. Because this dataset is a composition of larger than (96 px)2 . For each image, pose estimators have to
other datasets and to avoid contamination, our CrowdPose provide the 17 keypoint locations per pose and a total score
models are pretrained on ImageNet [57] and then trained on for each pose. Only the top 20 scoring poses per image are
CrowdPose only. The dataset comes with a split of 10,000 considered for evaluation.
images for training, 2,000 for validation and 8,000 images for b) Pose Tracks: A common metric to evaluate the track-
the test set. ing of human poses is the Multi Object Tracker Accuracy
b) COCO: The de-facto standard for person keypoint (MOTA) [60], [4] which is also the main metric in PoseTrack
prediction is the competitive COCO keypoint task [2]. The test challenges and leaderboards. It combines false positives, false
set is private and powers an active leaderboard via a protected negatives and ID switches into a single metric. We compare
challenge server. COCO contains 56,599 diverse training im- against the best methods that submitted to the PoseTrack 2017
ages with person keypoint annotations. The validation and test- and 2018 evaluation server which computes all metrics on
dev sets contain 5,000 and 20,288 images. private test sets. These methods include strong top-down
c) ApolloCar3D: We generalize our approach to vehicle methods as well as bottom-up methods for pose estimation
keypoints using the ApolloCar3D dataset [58], which contains and tracking.
5,277 driving images at a resolution of 4K and over 60K car
instances. The authors defined 66 semantic keypoints in the
C. Implementation Details
dataset and, for each car, they provided annotations for the
visible ones. For clarity, we choose a subset of 24 semantic a) Neural Network Configuration: All our models are
keypoints and show quantitative and qualitative results on this based on ResNet [50] or ShuffleNetV2 [51] base networks
dataset. and multiple head networks. The base networks have their
d) Animal Dataset: We evaluate the performances of our input max-pooling operation removed as it destroys spatial
algorithm on the Animal-Pose Dataset [33], which provides information. The stride from input image to output feature
annotations for five categories of animals: dog, cat, cow, horse, map is 16 with 2048 features at each location. We apply
sheep for a total of 20 keypoints. The dataset includes 5,517 no additional modifications to the standard ResNet models.
instances in more than 3,000 images. The majority of these We use the standard building blocks of ShuffleNetV2 back-
images originally belong to the VOC dataset [59]. bones to construct our custom configurations which we denote
e) PoseTrack 2017 and 2018: We conduct quantitative ShuffleNetV2K16/K30. A ShuffleNetV2K16 model has the
studies of our tracking performance on the PoseTrack 2017 [9] prediction accuracy of a ResNet50 with fewer parameters than
and 2018 [13] datasets. The datasets contain short video a ResNet18. The configuration is specified by the number of
sequences of annotated and tracked human poses in diverse output features of the five stages and the number of repetitions
situations. The PoseTrack 2018 dataset contains 593 training of the blocks in each stage. Our ShuffleNetV2K16 has output
scenes, 170 validation scenes and 375 test scenes. The test features (block repeats) of 24 (1), 348 (4), 696 (8), 1392 (4),
labels are private. PoseTrack 2017 is a subset of the 2018 1392 (1) and our ShuffleNetV2K30 has 32 (1), 512 (8), 1024
dataset with 292 train, 50 validation and 208 test scenes. (16), 2048 (6), 2048 (1). Spatial 3×3 convolutions are replaced
However, the 2018 leaderboard is frozen and new results are with 5×5 convolutions which introduces only a small increase
only updated for the 2017 leaderboard. Therefore, many recent in the number of parameters because all spatial convolutions
methods present results on the older, smaller dataset. Here, we are depth-wise.
will report numbers for both 2017 and 2018. Each head network is a single 1×1 convolution followed by
a sub-pixel convolution [52] to double the spatial resolution
bringing the total stride down to eight. Therefore, the spatial
B. Evaluation feature map size for an input image of 801px×801px is 101×
a) Single-Image Multi-Person Poses: Both CrowdPose 101. The confidence component of a field is normalized with a
and COCO follow COCO’s keypoint evaluation method. The sigmoid non-linearity and the scale components for joint-sizes
object keypoint similarity (OKS) score [2] is used to assign are enforced to be positive with a softplus [61].
a bounding box to each keypoint as a function of the person b) Augmentations: We apply the standard augmentations
instance bounding box area. Similar to detection, the metric of random horizontal flipping, random rescaling with a rescal-
computes overlaps between ground truth and predicted bounding factor r ∈ [0.5, 2.0], random cropping and padding to
ing boxes to compute the standard detection metrics average 385 × 385 followed by color jittering with 40% variation in
precision (AP) and average recall (AR). brightness and saturation and 10% variation in hue. We also
CrowdPose breaks down the test set at the image level convert a random 1% of the images to grayscale and generate
into easy, medium and hard. The easy set contains images strong JPEG compression artifacts in 10% of the images.
with a crowd index in [0, 0.1], the medium set in [0.1, 0.8] The tracking task is similarly augmented. The random
and the hard set in [0.8, 1.0]. Given the uniform crowd-index rescaling is adapted to an image width in [0.5×801, 1.5×801]
distribution, most images of the test set are in the medium and random cropping to a maximum image side of 385 px.
category. Half of the image pairs are randomly reoriented (rotations by
COCO breaks down the precision scores at the instance level multiples of 90◦ ). To increase the inter-frame variations, we
for medium instances with a bounding box area of (32 px)2 add a small synthetic camera shift of maximum 30 px between
8

Fig. 9. Illustration of OpenPifPaf predictions from the CrowdPose [12] val set with crowd-index hard on a sports scene, a family photo and a street scene.

image pairs. To further increase the variation, we form image synthetic shifts of up to 30px. Starting from a trained single-
pairs with a random interval of 4, 8 and 12 frames. In 20% image pose backbone, we train on both datasets with SGD [62]
of image pairs, we replace one of the images with a random with the same configuration as for single images. We alternate
image to provide a higher number of negative samples for the dataset every batch and only do an SGD-step every two
tracking. batches. We train for 50 epochs where every epoch consists
c) Single-Image Training: For ResNet [50] backbones, of 4994 batches. The training time is 55 minutes per epoch
we use ImageNet [57] pretrained models. ShuffleNetV2 [51] on two V100 GPUs.
models are trained from random initializations. We use the
SGD [62] optimizer with Nesterov momentum [63] of 0.95, D. Results
batch size of 32 and weight decay of 10−5 . The learning rate
a) Crowded Single-Image Pose Estimation: In Figure 9,
is exponentially warmed up for one epoch from 10−3 of its
we show example pose predictions from the CrowdPose [12]
target value. At certain epochs (specified below), the learning
validation set. We show results in a diverse selection of sports
rate is exponentially decayed over 10 epochs by a factor of
disciplines and everyday settings. All shown images are from
10. We employ model averaging [64], [65] to extract stable
the hard subset with a crowd-index larger than 0.8.
models for validation. At each optimization step, we update
In Table I, we show a quantitative comparison of our
an exponentially weighted version of the model parameters
performance with other methods. We are not only more
with a decay constant of 10−2 .
precise across all precision metrics AP, AP0.50 , AP0.75 , APeasy ,
On CrowdPose, which is a smaller dataset than COCO, we APmedium and APhard but also predict faster than all previous
train for 300 epochs. We set the target learning rate to 10−5 top-performing methods at 13.7 FPS (frames-per-second) on
and decay at epochs 250 and 280. a single GTX1080Ti.
On COCO, we use a target learning rate of 10−4 and decay b) COCO: All state-of-the-art methods compare their
at epoch 130 and 140. The training time for 150 epochs of a performance on the well-established COCO keypoint task [2].
ShuffleNetV2K16 on two V100 is approximately 37 hours. We Our quantitative results on the private 2017 test-dev set are
do not use any additional datasets beyond the COCO keypoint shown in Table II along with other bottom-up methods.
annotations. This comparison includes field-based methods [16], [18], [11]
d) Training for Tracking on PoseTrack: We use the and methods based on associative embedding [17], [24]. We
ShuffleNetV2k30 backbone for all our tracking experiments. perform on par with the best existing bottom-up method. We
PoseTrack 2018 is a video dataset which means that despite evaluate on rescaled images where the longer edge is 801 px
a large number of annotations, the variation is smaller than in which is the same image size that will be used for tracking
single-image pose datasets. Therefore, we keep single-image below. We evaluate a single forward pass without horizontal
pose estimation on the COCO dataset [2] as an auxiliary task flipping and without multi-scale evaluation because we aim for
and train on PoseTrack and COCO simultaneously. The type a fast method. The average time per image with a GTX1080Ti
of poses that are annotated in the two datasets are similar is 152 ms (63 ms on a V100) of which 29 ms is used for
but not identical, e.g., one dataset annotates the eyes and the decoding.
other does not. During training, we alternate the two tasks c) Pose Tracking: We want to track multiple human
between batches. In one batch we feed pairs of images from poses in videos. We train and validate on the PoseTrack 2018
the PoseTrack dataset and apply losses to the corresponding dataset [13]. Table III shows our main results for pose tracking
head networks and in the next batch we feed in single images on both of the private test sets of Posetrack 2017 and 2018.
from COCO and apply losses to the other head networks (see We also show our single-image average precision (AP) which
Figure 5). The COCO task is trained identical to the single- highlights that our performant tracking method can compen-
image pose estimation discussed in the previous section, but sate for a lower AP, e.g., compared to MSRA/FlowTrack [19],
converted from single images to pairs of tracked images via and still outperform in overall MOTA and FPS. All our results
9

TABLE I
E VALUATION ON THE C ROWD P OSE TEST DATASET [12]. O UR O PEN P IF PAF RESULT IS BASED ON A R ES N ET 50 BACKBONE WITH SINGLE - SCALE
∗
EVALUATION AT 641 PX . VALUES EXTRACTED FROM C ROWD P OSE PAPER [12]. + E MPLOYS MULTI - SCALE TESTING .

AP AP0.50 AP0.75 APeasy APmedium APhard FPS

Mask R-CNN∗ [15] 57.2 83.5 60.3 69.4 57.9 45.8 2.9
AlphaPose∗ [66] 61.0 81.3 66.0 71.2 61.4 51.1 10.9
HigherHRNet-W48 [24] 65.9 86.4 70.6 73.3 66.5 57.9 -
SPPE [12] 66.0 84.2 71.5 75.5 66.3 57.4 10.1
HigherHRNet-W48+ [24] 67.6 87.4 72.6 75.8 68.1 58.9 -
OpenPifPaf (ours) 70.5 89.1 76.1 78.4 72.1 63.8 13.7

TABLE II
E VALUATION METRICS FOR THE COCO 2017 TEST- DEV DATASET FOR
BOTTOM - UP METHODS . N UMBERS ARE EXTRACTED FROM THE
RESPECTIVE PAPERS . O UR PREDICTION TIME IS DETERMINED ON A
SINGLE V100 GPU. ∗ O NLY EVALUATING IMAGES WITH THREE PERSON
INSTANCES .

AP APM APL t [ms]

OpenPose [16] 61.8 57.1 68.2 100
Assoc. Emb. [17] 65.5 60.6 72.6 166
PersonLab [18] 68.7 64.1 75.5 -
MultiPoseNet [23] 69.6 65.0 76.3 43∗ Fig. 10. Qualitative results from the Posetrack 2018 [13] validation set.
HigherHRNet [24] 70.5 66.6 75.8 >1000 Images show tracks of spatio-temporal poses including their frame-to-frame
OpenPifPaf (ours) 71.9 68.5 77.4 69 associations where only connections that were used to construct the poses are
shown.

TABLE III
E VALUATION METRICS ON THE TEST SETS OF ( A ) P OSE T RACK 2018 [13]
AND ( B ) P OSE T RACK 2017 [9]. N UMBERS ARE EXTRACTED FROM THE
RESPECTIVE PAPERS AND THE LEADERBOARD . A LL METHODS ARE
ONLINE METHODS APART FROM D ET T RACK [67].

PoseTrack 2018 MOTA FPS AP

openSVAI [28] 54.5 - 63.1
MIPAL [26] 54.9 - 67.8
Miracle [30] 57.4 - 70.9
MSRA/FlowTrack [19] 61.4 0.7 74.0
OpenPifPaf (ours) 61.7 12.2 71.9
(a)

PoseTrack 2017 MOTA FPS AP

STAF [27] 53.8 3 70.3 Fig. 11. Qualitative results from the Posetrack 2018 [13] validation
MIPAL [26] 54.5 - 68.8 set. Left: Single-image detection. The person’s left shoulder is not visible
MSRA/FlowTrack [19] 57.8 0.7 74.6 and therefore the left arm cannot be connected to the rest of the body.
HRNet [20] 57.9 - 74.9 Right: Spatio-temporal pose. Multiple temporal connections allow to safely
LightTrack [29] 58.0 - 66.6 connect both left and right arm to the rest of the body.
OpenPifPaf (ours) 60.6 12.2 71.5
KeyTrack [32] 61.2 1.0 74.0
DetTrack (offline) [67] 64.1 - 74.1
(b)
Similarly, for the poses 3, 4, 5 and 7 in Figure 12, the asso-
ciations from shoulders to hips are often difficult because of
the lighting condition. Depending on the predicted association
are produced in a single pass and online (without future confidences, the decoder determines automatically whether to
frames). The frames per second (FPS) stated in Table III refer connect to a keypoint with a spatial or temporal connection. In
to the single process, sequential evaluation. In addition, we these difficult scenarios, the greedy decoder completed these
provide extra metrics that are not published on the leader- poses with multiple temporal connections (TCAF).
boards. For PoseTrack 2017, our MOTP is 84.5, precision is d) Pedestrian, Car and Animal Poses: A holistic per-
84.1 and recall is 77.7. For PoseTrack 2018, our MOTP is ception framework for autonomous vehicles also needs to be
84.9, precision is 84.4 and recall is 78.3. able to generalize to other classes than humans. We show that
Spatio-temporal poses on real-world examples are shown in we can predict poses of cars and animals with high accuracy
Figure 10. They show challenging scenarios with occlusions. in Figures 13 and 14 and provide a quantitative summary in
Figure 11 highlights the ability of spatio-temporal poses to Table IV.
complete poses through time, i.e., even when a pose is On car instances, our model achieves an average preci-
partitioned because of occlusion in the current frame, multiple sion (AP) of 76.1%. The AP metric follows the same protocol
temporal connections (TCAF) form a single tracked pose. of human instances, but to the best of our knowledge no
10

This option would not be used in most real-world settings.

Without forcing complete poses, the decoding time and the
total prediction time is reduced by about 10ms.
a) Backbone: The reference backbone is a small Shuf-
fleNetV2K16. We show comparisons to the larger ResNet50
and ShuffleNetV2K30 backbones and show how they improve
precision (AP) and at what cost in timing.
b) Keypoint Criterium: We try to illuminate why our
precision and speed is significantly better than OpenPose [16].
Fig. 12. Qualitative results from the Posetrack 2018 [13] validation set. OpenPose first detects keypoints and then associates them.
(a) indicates a connection that has been made spatially in a previous frame Therefore, every keypoint has to be detectable individually.
but for the last few frames the left leg of person 3 is connected to the rest
of the body only through temporal connections. (b) shows a connection that In OpenPifPaf, new keypoint associations are generated from
is temporarily occluded by the arm of the person in front and also here our a source keypoint. These new keypoints are not previously
algorithm decided to connect the left leg via temporal connections instead of known. They are discovered in the association. That allows
spatial ones.
OpenPifPaf to generate poses from a strong seed keypoint and
connect to less confident keypoints. In “independent-only”, we
TABLE IV
Q UANTIFYING DETECTION PERFORMANCE FOR PEDESTRIANS , CARS AND restrict the keypoints of OpenPifPaf to be all of the quality of
ANIMALS . I N THE “P EDESTRIANS ” COLUMN , WE SHOW THE DETECTION an independent seed keypoint and observe a dramatic drop of
RATE ON KITTI [68] WITH I O U=0.3 AND INSTANCE THRESHOLD OF 0.2 8.1% in AP.
FOR ALL METHODS . F OR “V EHICLES ”, WE SHOW THE KEYPOINT
DETECTION RATE ON A POLLO C AR 3D [58] WHICH WAS PUBLISHED IN c) Frontier Decoder: Next, we study the impact of the
PREVIOUS METHODS AND WE ALSO PROVIDE AP IN THE TEXT. I N THE Frontier decoder with respect to a simpler decoder without
“A NIMALS ” COLUMN , WE PROVIDE KEYPOINT AP AS DEFINED IN THE frontier. The standard pose is sparsely connected and, there-
A NIMAL -P OSE DATASET [33].
fore, the frontier only has few alternatives to prioritize. For
Method Pedestrians Vehicles Animals a denser pose (“dense”), the impact of the frontier (compare
Mono3D [69] 73.2 - - with “no-frontier and dense”) is more pronounced (+0.3 AP).
3DOP (stereo) [70] 73.1 - -
MonoDIS [71] 60.5 - -
d) Memory Efficient Decoding: In the bottom part of
SMOKE [72] 39.1 - - Table V, we study the effect of removing the high-resolution
MonoPSR [73] 82.8 - - accumulation map (HR) to reduce the memory footprint. This
CPM [21] - 75.4 -
WS-CDA [33] - - 44.3
high resolution map is used in two places. First, to rescore
OpenPifPaf (ours) 84.6 86.1 47.8 the seeds and, second, to rescore the CAF. The impact of the
Human labelers [58] - 92.4 - seed rescoring is only 0.1 in AP but comes at a large cost
in decoding time. As an alternative, we investigate a local
non-maximum suppression (NMS) that selects a seed only if
previous method has evaluated AP on ApolloCar3D [58] it is the highest confidence in a 3 × 3 window (introduced
without leveraging 3D information. Hence, we include a study in CenterNet [74]). This NMS reduces the decoding time
on the keypoint detection rate, which has been defined in but not back to the original speed. Independently, we study
the ApolloCar3D dataset [58] and considers a keypoint to be the impact of rescoring the CAF field which is about +1.0%
correctly estimated if the error is less than 10 pixels. Our in AP. Only when both the seed rescoring and the CAF
method achieves a detection rate of 86.1% compared to 75.4% rescoring are removed, the creation of the HR maps can
of CPM [21]. Notably, the authors of ApolloCar3D [58] also be omitted. In that memory efficient configuration (bottom
report the detection rate of the human labelers to be 92.4%. line in Table V), the AP dropped by 1.4% with respect to
On animal instances, our model achieves an AP of 47.8%,
“original”. This demonstrates the importance of the high-
compared to 44.3% of WS-CDA, the baseline developed by the
resolution accumulation for speed and accuracy and which
authors of the Animal-Pose dataset [33]. Lower performances
should only be removed when absolutely necessary.
on animals are due to the smaller dataset size with just
e) Tracking Baselines: We conducted detailed studies of
4K training instances. Simultaneous training for humans and
our method on the Posetrack 2018 validation set that are shown
animals to achieve better generalization is left for future work.
in Table VI. First, we created two baselines ourselves. Both
baselines first do single-image pose estimation and then use the
E. Ablation Studies Hungarian algorithm [75] to track poses from frame to frame.
We study the impact of the backbone, the precise criteria for Our first algorithm uses a simple Euclidean distance between
a keypoint, our proposed Frontier decoder, a memory efficient joints to construct a pose similarity score. Our second method
decoder, alternatives to TCAF and the impact of input image replaces the Euclidean distance with an OKS-based distance
size. We start with studies for single images on the COCO val that is used in the COCO metric to compare predictions to
set (Table V) before moving to tracking studies for PoseTrack ground truth. Both methods show a drop in MOTA of 1.5 and
(Table VI). 2.0 while operating at about the same speed as our “original”
Our single-image studies are run with an option to force model. This demonstrates that the overhead of our tracking
complete poses. This is the common practice as the COCO network is comparable to the small overhead of the Hungarian
metric does not penalize false positive keypoints within poses. algorithm with respect to the single-image model.
11

Fig. 13. Qualitative results from the KITTI [68] and ApolloCar3D [58] datasets. We resolve distant pedestrians, cyclists and cars and handle changing
lighting conditions well.

TABLE V
A BLATION STUDIES OF SKELETON CHOICE AND DECODER CONFIGURATIONS FOR SINGLE - IMAGE POSE ESTIMATION . A LL RESULTS ( EXCEPT WHERE
EXPLICITLY STATED OTHERWISE ) ARE PRODUCED WITH THE SAME S HUFFLE N ET V2 K 16 MODEL ON THE COCO VAL SET [2] ON A SINGLE GTX1080T I .
F IRST, WE REVIEW DIFFERENT BACKBONE ARCHITECTURES ( A R ES N ET 50 [50] AND A LARGER S HUFFLE N ET V2 [51]). S ECOND , WE SHOW THAT ONLY
USING CONFIDENT KEYPOINTS LEADS TO A LARGE DROP IN PRECISION . T HIRD , WE OBSERVE THAT THE F RONTIER DECODER IS MORE IMPORTANT FOR
DENSER SKELETONS WHILE INCURRING ALMOST NO OVERHEAD ON SPARSE SKELETONS . F OURTH , WE CAN PRODUCE A MEMORY- EFFICIENT VERSION
OF OUR DECODER AT A COST OF 1.4% IN AP. T HE BIGGEST DROP IN ACCURACY COMES FROM NOT RESCORING THE CAF FIELD AND THE LARGEST
CONTRIBUTOR TO INCREASING THE INFERENCE TIME IS NOT RESCORING THE SEEDS .

AP AP0.50 AP0.75 APM APL t [ms] tdec [ms]

original (ShuffleNetV2K16) 66.8 86.5 73.2 62.1 74.6 50 19
Backbone ResNet50 68.2 87.9 74.6 65.8 72.7 64 22
ShuffleNetV2K30 71.0 88.8 77.7 66.6 78.5 92 16
Keypoints independent-only -8.1 -6.3 -9.5 -8.7 -7.3 ±0 ±0
Frontier decoder no-frontier ±0.0 -0.1 +0.1 ±0.0 -0.1 -1 -1
dense +0.1 +0.2 +0.2 -0.3 +0.5 +15 +15
no-frontier and dense -0.3 +0.1 -0.1 -0.5 ±0.0 +14 +14
memory efficient no seed rescoring -0.1 -0.4 -0.1 +0.2 +0.1 +71 +54
no seed rescoring (with NMS) +0.1 +0.1 ±0.0 +0.2 +0.0 +19 +15
no CAF rescoring -1.0 -0.3 -1.0 -1.0 -1.7 -1 -1
no rescoring (with NMS), without HR -1.4 -0.4 -1.4 -1.0 -2.3 +9 +7
12

R EFERENCES
[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human
pose estimation: New benchmark and state of the art analysis,” in IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June
2014. 1
[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in con-
text,” in European Conference on Computer Vision (ECCV). Springer,
2014, pp. 740–755. 1, 2, 5, 6, 7, 8, 11
[3] A. Crow, “How safe are self-driving cars?” Rocky Mountain Institute,
5 2017. [Online]. Available: https://rmi.org/safe-self-driving-cars/ 1
Fig. 14. Qualitative results from the Animal-Pose dataset [33]. The left [4] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A
image was processed by a person model and an animal model. benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831,
2016. 1, 7
[5] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fer-
TABLE VI nandez, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual
BASELINES AND ABLATION STUDIES ON THE P OSE T RACK 2018 object tracking vot2015 challenge results,” in Proceedings of the IEEE
VALIDATION SET [13] ON A SINGLE V100 GPU. W E OUTPERFORM international conference on computer vision workshops, 2015, pp. 1–23.
H UNGARIAN TRACKERS WITH EUCLIDEAN AND OKS DISTANCE 1
FUNCTIONS IN ACCURACY FOR A SMALL OVERHEAD IN FPS. W E ALSO [6] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Ce-
STUDY OUR SENSITIVITY TO THE INPUT IMAGE SIZE . F OR IMAGE SIZES OF hovin Zajc, T. Vojir, G. Hager, A. Lukezic, A. Eldesokey et al., “The
513 PX , WE OBSERVE A DROP OF 2.9 IN MOTA BUT RUN 82% FASTER AT visual object tracking vot2017 challenge results,” in Proceedings of the
22.2 FPS. IEEE international conference on computer vision workshops, 2017, pp.
1949–1972. 1
MOTA FPS [7] B. D. Lucas, T. Kanade et al., “An iterative image registration technique
original (801px) 66.4 12.2 with an application to stereo vision,” 1981. 1
Hungarian euclidean -1.5 +4% [8] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov,
OKS -2.0 +1% B. Andres, and B. Schiele, “Arttrack: Articulated multi-person tracking
Image size 513px -2.9 +82% in the wild,” in Conference on Computer Vision and Pattern Recognition
641px -0.9 +37% (CVPR), 2017, pp. 6457–6465. 1, 2
1201px -1.7 -49% [9] U. Iqbal, A. Milan, and J. Gall, “Posetrack: Joint multi-person pose
estimation and tracking,” in Conference on Computer Vision and Pattern
Recognition (CVPR), 2017, pp. 2011–2020. 1, 2, 6, 7, 9
[10] A. Doering, U. Iqbal, and J. Gall, “Joint flow: Temporal flow fields for
f) Tracking Ablation: We studied the effect of input multi person tracking,” arXiv preprint arXiv:1805.04596, 2018. 1, 2, 6
image size at the bottom of Table VI. Our “original” model [11] S. Kreiss, L. Bertoni, and A. Alahi, “Pifpaf: Composite fields for
human pose estimation,” in Conference on Computer Vision and Pattern
rescales the image width to 801px. Larger images do not Recognition (CVPR), June 2019. 2, 3, 4, 5, 8
show an improvement in accuracy (MOTA) while becoming [12] J. Li, C. Wang, H. Zhu, Y. Mao, H.-S. Fang, and C. Lu, “Crowdpose:
significantly slower. Smaller input images decrease MOTA Efficient crowded scenes pose estimation and a new benchmark,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
but at the same time can drastically increase speed. Most Recognition, 2019, pp. 10 863–10 872. 2, 6, 8, 9
applications can probably tolerate an accuracy reduction by [13] M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall,
0.9 in MOTA to improve speed by +37%. When the input and B. Schiele, “Posetrack: A benchmark for human pose estimation and
tracking,” in Conference on Computer Vision and Pattern Recognition
image size is reduced to 513px, MOTA drops by 2.9 (still a (CVPR), 2018, pp. 5167–5176. 2, 6, 7, 8, 9, 10, 12
great result) which comes with a speed improvement of +82% [14] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via
to a fast 22.2 FPS. deep neural networks,” in Conference on Computer Vision and Pattern
Recognition (CVPR), 2014, pp. 1653–1660. 2
[15] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
VI. C ONCLUSIONS Computer Vision (ICCV), 2017 IEEE International Conference on.
IEEE, 2017, pp. 2980–2988. 2, 9
We have demonstrated a new method for bottom-up pose [16] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d
tracking for 2D human poses and shown its strength in pose estimation using part affinity fields,” in Conference on Computer
Vision and Pattern Recognition (CVPR), 2017, pp. 7291–7299. 2, 3, 4,
crowded and occluded scenes that are relevant for perception 8, 9, 10
in self-driving cars and social robots. We outperform previous [17] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-to-
state-of-the-art methods on CrowdPose and on PoseTrack2018. end learning for joint detection and grouping,” in Advances in Neural
Information Processing Systems, 2017, pp. 2277–2287. 2, 8, 9
On PoseTrack2017 we are on par with the state-of-the-art but [18] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Mur-
run an order of magnitude faster. We have also shown that our phy, “Personlab: Person pose estimation and instance segmentation with
method generalizes to pose estimation of cars and animals. a bottom-up, part-based, geometric embedding model,” in European
Conference on Computer Vision (ECCV), 2018, pp. 269–286. 2, 3, 5,
We can run all versions simultaneously on an image sequence 6, 8, 9
and form the union of the predictions. In the future, we can [19] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose
investigate shared backbone architectures to create a holistic estimation and tracking,” in Proceedings of the European conference
on computer vision (ECCV), 2018, pp. 466–481. 2, 8, 9
perception framework for autonomous vehicles. [20] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen-
tation learning for human pose estimation,” in Conference on Computer
VII. ACKNOWLEDGEMENTS Vision and Pattern Recognition (CVPR), 2019, pp. 5693–5703. 2, 9
[21] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional
This work was supported by the Swiss National Science pose machines,” in Conference on Computer Vision and Pattern Recog-
Foundation under the Grant 2OOO21-L92326 and the SNSF nition (CVPR), 2016, pp. 4724–4732. 2, 10
[22] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
Spark fund (190677). We also thank our lab members and human pose estimation,” in European Conference on Computer Vision
reviewers for their valuable comments. (ECCV). Springer, 2016, pp. 483–499. 2
13

[23] M. Kocabas, S. Karagoz, and E. Akbas, “Multiposenet: Fast multi-person [44] S. Zuffi, A. Kanazawa, T. Berger-Wolf, and M. Black, “Three-d safari:
pose estimation using pose residual network,” in European Conference Learning to estimate zebra pose, shape, and texture from images “in
on Computer Vision (ECCV), 2018, pp. 417–433. 2, 9 the wild”,” in Proceedings of the IEEE International Conference on
[24] B. Cheng, B. Xiao, J. Wang, H. Shi, T. S. Huang, and L. Zhang, Computer Vision (ICCV), 2019, pp. 5358–5367. 2
“Higherhrnet: Scale-aware representation learning for bottom-up human [45] N. Dinesh Reddy, M. Vo, and S. G. Narasimhan, “Carfusion: Combining
pose estimation,” in Proceedings of the IEEE/CVF Conference on point tracking and part detection for dynamic 3d reconstruction of
Computer Vision and Pattern Recognition, 2020, pp. 5386–5395. 2, vehicles,” in Proceedings of the IEEE Conference on Computer Vision
8, 9 and Pattern Recognition (CVPR), 2018, pp. 1906–1915. 2
[25] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. [46] L. Ke, S. Li, Y. Sun, Y.-W. Tai, and C.-K. Tang, “Gsnet: Joint vehicle
Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling for pose and shape reconstruction with geometrical and scene-aware super-
multi person pose estimation,” in Conference on Computer Vision and vision,” in European Conference on Computer Vision (ECCV). Springer,
Pattern Recognition (CVPR), 2016, pp. 4929–4937. 2 2020, pp. 515–532. 2
[26] J. Hwang, J. Lee, S. Park, and N. Kwak, “Pose estimator and tracker [47] Z. Cao, G. H. Martinez, T. Simon, S.-E. Wei, and Y. A. Sheikh,
using temporal flow maps for limbs,” in 2019 International Joint “Openpose: realtime multi-person 2d pose estimation using part affinity
Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8. 2, fields,” IEEE transactions on pattern analysis and machine intelligence,
6, 9 2019. 2
[27] Y. Raaj, H. Idrees, G. Hidalgo, and Y. Sheikh, “Efficient online multi- [48] H. C. Sánchez, A. H. Martı́nez, R. I. Gonzalo, N. H. Parra, I. P. Alonso,
person 2d pose tracking with recurrent spatio-temporal affinity fields,” and D. Fernandez-Llorca, “Simple baseline for vehicle pose estimation:
in Conference on Computer Vision and Pattern Recognition (CVPR), Experimental validation,” IEEE Access, vol. 8, pp. 132 539–132 550,
2019, pp. 4620–4628. 2, 9 2020. 2
[28] G. Ning, P. Liu, X. Fan, and C. Zhang, “A top-down approach to ar- [49] Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark
ticulated human pose estimation and tracking,” in European Conference for 3d object detection in the wild,” in Proceeding of the IEEE Winter
on Computer Vision (ECCV), 2018, pp. 0–0. 2, 9 Conference on Applications of Computer Vision (WACV). IEEE, 2014,
[29] G. Ning, J. Pei, and H. Huang, “Lighttrack: A generic framework for pp. 75–82. 2
online top-down human pose tracking,” in Proceedings of the IEEE/CVF [50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Conference on Computer Vision and Pattern Recognition Workshops, recognition,” in Conference on Computer Vision and Pattern Recognition
2020, pp. 1034–1035. 2, 9 (CVPR), 2016, pp. 770–778. 4, 7, 8, 11
[30] D. Yu, K. Su, J. Sun, and C. Wang, “Multi-person pose estimation for [51] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical
pose tracking with enhanced cascaded pyramid network,” in European guidelines for efficient cnn architecture design,” in European Conference
Conference on Computer Vision (ECCV), 2018. 2, 9 on Computer Vision (ECCV), 2018, pp. 116–131. 4, 7, 8, 11
[31] S. Jin, W. Liu, W. Ouyang, and C. Qian, “Multi-person articulated [52] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
tracking with spatial and temporal embeddings,” in Conference on D. Rueckert, and Z. Wang, “Real-time single image and video super-
Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5664– resolution using an efficient sub-pixel convolutional neural network,” in
5673. 2 Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
[32] M. Snower, A. Kadav, F. Lai, and H. P. Graf, “15 keypoints is all you pp. 1874–1883. 4, 7
need,” arXiv preprint arXiv:1912.02323, 2019. 2, 9 [53] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler,
[33] J. Cao, H. Tang, H.-S. Fang, X. Shen, C. Lu, and Y.-W. Tai, “Cross- and K. Murphy, “Towards accurate multi-person pose estimation in
domain adaptation for animal pose estimation,” in Proceedings of the the wild,” in Conference on Computer Vision and Pattern Recognition
IEEE International Conference on Computer Vision (ICCV), 2019, pp. (CVPR), vol. 3, no. 4, 2017, p. 6. 3
9498–9507. 2, 7, 10, 12 [54] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
[34] N. D. Reddy, M. Vo, and S. G. Narasimhan, “Occlusion-net: 2d/3d for dense object detection,” in Proceedings of the IEEE international
occluded keypoint localization using graph networks,” in Proceedings conference on computer vision (ICCV), 2017, pp. 2980–2988. 5
of the IEEE Conference on Computer Vision and Pattern Recognition [55] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian
(CVPR), 2019, pp. 7326–7335. 2 deep learning for computer vision?” in Advances in neural information
[35] T. D. Pereira, D. E. Aldarondo, L. Willmore, M. Kislin, S. S.-H. Wang, processing systems, 2017, pp. 5574–5584. 5
M. Murthy, and J. W. Shaevitz, “Fast animal pose estimation using deep [56] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, and
neural networks,” Nature methods, vol. 16, no. 1, pp. 117–125, 2019. 2 M. Grundmann, “Blazeface: Sub-millisecond neural face detection on
[36] B. Biggs, O. Boyne, J. Charles, A. Fitzgibbon, and R. Cipolla, “Who left mobile gpus,” arXiv preprint arXiv:1907.05047, 2019. 5
the dogs out? 3d animal reconstruction with expectation maximization [57] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
in the loop,” in European Conference on Computer Vision (ECCV). A large-scale hierarchical image database,” in Computer Vision and
Springer, 2020, pp. 195–211. 2 Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,
[37] S. Li, J. Li, H. Tang, R. Qian, and W. Lin, “Atrw: A benchmark for 2009, pp. 248–255. 7, 8
amur tiger re-identification in the wild,” in Proceedings of the 28th [58] X. Song, P. Wang, D. Zhou, R. Zhu, C. Guan, Y. Dai, H. Su, H. Li, and
ACM International Conference on Multimedia. New York, NY, USA: R. Yang, “Apollocar3d: A large 3d car instance understanding bench-
Association for Computing Machinery, 2020, p. 2590–2598. 2 mark for autonomous driving,” in Proceedings of the IEEE Conference
[38] A. Mathis, M. Yüksekgönül, B. Rogers, M. Bethge, and M. W. Mathis, on Computer Vision and Pattern Recognition, 2019, pp. 5452–5462. 7,
“Pretraining boosts out-of-domain robustness for pose estimation,” in 10, 11
Proceeding of the IEEE Winter Conference on Applications of Computer [59] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn,
Vision (WACV), 2021. 2 and A. Zisserman, “The pascal visual object classes challenge: A
[39] A. Mathis, P. Mamidanna, K. M. Cury, T. Abe, V. N. Murthy, M. W. retrospective,” International journal of computer vision, vol. 111, no. 1,
Mathis, and M. Bethge, “Deeplabcut: markerless pose estimation of user- pp. 98–136, 2015. 7
defined body parts with deep learning,” Nature Publishing Group, Tech. [60] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking
Rep., 2018. 2 performance: the clear mot metrics,” EURASIP Journal on Image and
[40] J. Mu, W. Qiu, G. D. Hager, and A. L. Yuille, “Learning from synthetic Video Processing, vol. 2008, pp. 1–10, 2008. 7
animals,” in Proceedings of the IEEE Conference on Computer Vision [61] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia, “Incor-
and Pattern Recognition (CVPR), 2020, pp. 12 386–12 395. 2 porating second-order functional knowledge for better option pricing,”
[41] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, Advances in neural information processing systems, vol. 13, pp. 472–
“Smpl: A skinned multi-person linear model,” ACM transactions on 478, 2000. 7
graphics (TOG), vol. 34, no. 6, pp. 1–16, 2015. 2 [62] L. Bottou, “Large-scale machine learning with stochastic gradient de-
[42] S. Zuffi, A. Kanazawa, and M. J. Black, “Lions and tigers and bears: scent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–
Capturing non-rigid, 3d, articulated shape from images,” in Proceedings 186. 8
of the IEEE Conference on Computer Vision and Pattern Recognition [63] Y. Nesterov, “A method of solving a convex programming problem with
(CVPR), 2018, pp. 3955–3963. 2 convergence rate o(1/k2),” in Soviet Mathematics Doklady, vol. 27, no. 2,
[43] B. Biggs, T. Roddick, A. Fitzgibbon, and R. Cipolla, “Creatures great 1983, pp. 372–376. 8
and smal: Recovering the shape and motion of animals from video,” in [64] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
Asian Conference on Computer Vision (ACCV). Springer, 2018, pp. tion by averaging,” SIAM journal on control and optimization, vol. 30,
3–19. 2 no. 4, pp. 838–855, 1992. 8
14

[65] D. Ruppert, “Efficient estimations from a slowly convergent robbins- Alexandre Alahi is an Assistant Professor at EPFL.
monro process,” Cornell University Operations Research and Industrial He spent five years at Stanford University as a
Engineering, Tech. Rep., 1988. 8 Post-doc and Research Scientist after obtaining his
[66] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “Rmpe: Regional multi- Ph.D. from EPFL. His research enables machines to
person pose estimation,” in International Conference on Computer perceive the world and make decisions in the context
Vision (ICCV), 2017, pp. 2334–2343. 9 of transportation problems and smart environments.
[67] M. Wang, J. Tighe, and D. Modolo, “Combining detection and tracking He has worked on the theoretical challenges and
for human pose estimation in videos,” in Proceedings of the IEEE/CVF practical applications of socially-aware Artificial In-
Conference on Computer Vision and Pattern Recognition, 2020, pp. telligence, i.e., systems equipped with perception
11 088–11 096. 9 and social intelligence. He was awarded the Swiss
[68] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: NSF early and advanced researcher grants for his
The kitti dataset,” International Journal of Robotics Research (IJRR), work on predicting human social behavior. Alexandre has also co-founded
2013. 10, 11 multiple startups such as Visiosafe, and won several startup competitions. He
[69] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object was elected as one of the Top 20 Swiss Venture leaders in 2010.
detection network for autonomous driving,” in Proceedings of the IEEE
conference on Computer Vision and Pattern Recognition, 2017, pp.
1907–1915. 10
[70] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and
R. Urtasun, “3d object proposals for accurate object class detection,” in
Advances in Neural Information Processing Systems. Citeseer, 2015,
pp. 424–432. 10
[71] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and
P. Kontschieder, “Disentangling monocular 3d object detection,” in
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2019, pp. 1991–1999. 10
[72] Z. Liu, Z. Wu, and R. Tóth, “Smoke: Single-stage monocular 3d object
detection via keypoint estimation,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops,
2020, pp. 996–997. 10
[73] J. Ku, A. D. Pon, and S. L. Waslander, “Monocular 3d object detection
leveraging accurate proposals and shape reconstruction,” in Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition,
2019, pp. 11 867–11 876. 10
[74] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv
preprint arXiv:1904.07850, 2019. 10
[75] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval
research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955. 10

Sven Kreiss is a postdoc at the Visual Intelligence

for Transportation (VITA) lab at EPFL in Switzer-
land focusing on perception with composite fields.
Before returning to academia, he was the Senior
Data Scientist at Sidewalk Labs (Alphabet, Google
sister) and worked on geospatial machine learning
for urban environments. Prior to his industry expe-
rience, Sven developed statistical tools and methods
used in particle physics research.

Lorenzo Bertoni is a doctoral student at the Visual

Intelligence for Transportation (VITA) lab at EPFL
in Switzerland focusing on 3D vision for vulnerable
road users. Before joining EPFL, Lorenzo was a
management consultant at Oliver Wyman and a visit-
ing researcher at the University of California, Berke-
ley, working on predictive control of autonomous
vehicles. Lorenzo received Bachelors and Masters
Degrees in Engineering from the Polytechnic Uni-
versity of Turin and the University of Illinois at
Chicago.

Ikea Campaign
No ratings yet
Ikea Campaign
7 pages
Toaz - Info Chapter 10 SB Answers PR
No ratings yet
Toaz - Info Chapter 10 SB Answers PR
18 pages
C#.NET 17.0 Material - Harsha
100% (1)
C#.NET 17.0 Material - Harsha
1,266 pages
API 653 Vol 1
92% (12)
API 653 Vol 1
128 pages
Lemke - Biopolitics
No ratings yet
Lemke - Biopolitics
80 pages
oedestrian
No ratings yet
oedestrian
6 pages
Aiav Unit 2 Notes
No ratings yet
Aiav Unit 2 Notes
8 pages
Fast and Furious
No ratings yet
Fast and Furious
9 pages
Advanced Topics in Autonomous Driving Using Deep Learning: Presenter: Nasim Souly
No ratings yet
Advanced Topics in Autonomous Driving Using Deep Learning: Presenter: Nasim Souly
41 pages
Hu Joint Monocular 3D Vehicle Detection and Tracking ICCV 2019 Paper
No ratings yet
Hu Joint Monocular 3D Vehicle Detection and Tracking ICCV 2019 Paper
10 pages
Joint Monocular 3D Vehicle Detection and Tracking
No ratings yet
Joint Monocular 3D Vehicle Detection and Tracking
18 pages
s10489-023-04998-3 (1)
No ratings yet
s10489-023-04998-3 (1)
19 pages
1602 00763
No ratings yet
1602 00763
5 pages
Kreiss PifPaf Composite Fields For Human Pose Estimation CVPR 2019 Paper
No ratings yet
Kreiss PifPaf Composite Fields For Human Pose Estimation CVPR 2019 Paper
10 pages
Trackformer
No ratings yet
Trackformer
16 pages
Object Detection and Segmentation On Tensor Flow Using
No ratings yet
Object Detection and Segmentation On Tensor Flow Using
10 pages
Deep Object Pose Estimation For Semantic Robotic Grasping of Household Objects
No ratings yet
Deep Object Pose Estimation For Semantic Robotic Grasping of Household Objects
11 pages
Key Track
No ratings yet
Key Track
11 pages
Research Paper Zebra Pose
No ratings yet
Research Paper Zebra Pose
16 pages
2018 Ijcv Visual Compiler
No ratings yet
2018 Ijcv Visual Compiler
19 pages
2207.04551v2
No ratings yet
2207.04551v2
38 pages
Detect To Track and Track To Detect
No ratings yet
Detect To Track and Track To Detect
10 pages
2412.04592v1
No ratings yet
2412.04592v1
13 pages
Literature Survey For Robotics
No ratings yet
Literature Survey For Robotics
6 pages
11
No ratings yet
11
19 pages
Object Recognition and Detection With Deep Learning For Autonomous Driving Applications
No ratings yet
Object Recognition and Detection With Deep Learning For Autonomous Driving Applications
11 pages
Real-Time Object Detection Using Deep Learning: Journal of Advances in Mathematics and Computer Science June 2023
No ratings yet
Real-Time Object Detection Using Deep Learning: Journal of Advances in Mathematics and Computer Science June 2023
10 pages
Electronics 10 03159 v2
No ratings yet
Electronics 10 03159 v2
22 pages
Mc Track
No ratings yet
Mc Track
14 pages
CAPformer Pedestrian Crossing Action Prediction Us
No ratings yet
CAPformer Pedestrian Crossing Action Prediction Us
22 pages
3D Visual Tracking of Articulated Objects and Hands
No ratings yet
3D Visual Tracking of Articulated Objects and Hands
16 pages
RTtracking DeepAssociationMetric
No ratings yet
RTtracking DeepAssociationMetric
5 pages
CORT: Class-Oriented Real-Time Tracking For Embedded Systems
No ratings yet
CORT: Class-Oriented Real-Time Tracking For Embedded Systems
10 pages
Blazepose: On-Device Real-Time Body Pose Tracking
No ratings yet
Blazepose: On-Device Real-Time Body Pose Tracking
4 pages
Social Distance
No ratings yet
Social Distance
18 pages
End-To-End Contextual Perception and Prediction With Interaction Transformer
No ratings yet
End-To-End Contextual Perception and Prediction With Interaction Transformer
8 pages
EScholarship UC Item 3rd9150m
No ratings yet
EScholarship UC Item 3rd9150m
128 pages
1.mot Ijsae
No ratings yet
1.mot Ijsae
10 pages
Ijet V4i3p31 PDF
No ratings yet
Ijet V4i3p31 PDF
5 pages
Pfister15 PHD Thesis PDF
No ratings yet
Pfister15 PHD Thesis PDF
220 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
paper for bibliometric analysis_occlusion
No ratings yet
paper for bibliometric analysis_occlusion
3 pages
Electronics 13 02790
No ratings yet
Electronics 13 02790
15 pages
Girdhar Detect-and-Track Efficient Pose CVPR 2018 Paper
No ratings yet
Girdhar Detect-and-Track Efficient Pose CVPR 2018 Paper
10 pages
Sha STA
No ratings yet
Sha STA
10 pages
Multi Object Tracking in Traffic Environments: A Systematic Literature
No ratings yet
Multi Object Tracking in Traffic Environments: A Systematic Literature
13 pages
IEEE _ MODAT_Sairaj
No ratings yet
IEEE _ MODAT_Sairaj
4 pages
Simple Online and Realtime Tracking With Occlusion Handling
No ratings yet
Simple Online and Realtime Tracking With Occlusion Handling
9 pages
Electronics 13 00967 v4
No ratings yet
Electronics 13 00967 v4
16 pages
1-s2.0-S0141938225001064-main
No ratings yet
1-s2.0-S0141938225001064-main
26 pages
I2c-Net Using Instance-Level Neural Networks For M
No ratings yet
I2c-Net Using Instance-Level Neural Networks For M
8 pages
IEEE Xplore Reference Download 2024.8.24.11.19.15
No ratings yet
IEEE Xplore Reference Download 2024.8.24.11.19.15
2 pages
Fairmot Explained 1
No ratings yet
Fairmot Explained 1
19 pages
Object Detection Using Tensorflow....
No ratings yet
Object Detection Using Tensorflow....
9 pages
Detect-And-Track Efficient Pose Estimation in Videos
No ratings yet
Detect-And-Track Efficient Pose Estimation in Videos
10 pages
Final Report - Removed
No ratings yet
Final Report - Removed
43 pages
CVPRW 2022 Aicity
No ratings yet
CVPRW 2022 Aicity
9 pages
Tag Draft Especializado
No ratings yet
Tag Draft Especializado
14 pages
Detection New
No ratings yet
Detection New
13 pages
Pedestrian Detection and Tracking
No ratings yet
Pedestrian Detection and Tracking
13 pages
Sensors 22 04833
No ratings yet
Sensors 22 04833
17 pages
Articulated Body Pose Estimation: Unlocking Human Motion in Computer Vision
From Everand
Articulated Body Pose Estimation: Unlocking Human Motion in Computer Vision
Fouad Sabry
No ratings yet
2802 8020 1 PB
No ratings yet
2802 8020 1 PB
3 pages
5 Major Computervision Technique
No ratings yet
5 Major Computervision Technique
10 pages
GPT-4o
No ratings yet
GPT-4o
5 pages
Statistics YTU Day 1_70c47b3d-23fd-4707-8184-60cbab30a3c3
No ratings yet
Statistics YTU Day 1_70c47b3d-23fd-4707-8184-60cbab30a3c3
37 pages
File 2
No ratings yet
File 2
1 page
Efectos de Aplicaciones de Bioestimulantes en El Rendimiento y La Calidad Del Cultivo de Frutilla o Fresa
No ratings yet
Efectos de Aplicaciones de Bioestimulantes en El Rendimiento y La Calidad Del Cultivo de Frutilla o Fresa
16 pages
Disjunktni Skupovi
No ratings yet
Disjunktni Skupovi
44 pages
Mora, Priscila - Youth Innovation and Entrepreneurship As Drivers of Economic Development and Growth in The Americas
No ratings yet
Mora, Priscila - Youth Innovation and Entrepreneurship As Drivers of Economic Development and Growth in The Americas
11 pages
Performance Assessment and Review Admin and Accounts Managers
No ratings yet
Performance Assessment and Review Admin and Accounts Managers
3 pages
Saudi Aramco Test Report Pre-Test Punch List Form: Testing Mech-SATR-A-2007
100% (2)
Saudi Aramco Test Report Pre-Test Punch List Form: Testing Mech-SATR-A-2007
1 page
AUC To Fixed Asset Settlement
No ratings yet
AUC To Fixed Asset Settlement
39 pages
Brand Sense 2
No ratings yet
Brand Sense 2
35 pages
_ Rilco- Calcium silicate Insulation
No ratings yet
_ Rilco- Calcium silicate Insulation
5 pages
How I Learned That the Problem in My Marriage Was Me - The New York Times
No ratings yet
How I Learned That the Problem in My Marriage Was Me - The New York Times
14 pages
RES 16-MV SWGR Submittal Accmi Project ABB Rev.01
No ratings yet
RES 16-MV SWGR Submittal Accmi Project ABB Rev.01
589 pages
Test Planner Cet
No ratings yet
Test Planner Cet
8 pages
21.sop For Soil Investigation Work
No ratings yet
21.sop For Soil Investigation Work
21 pages
Sarva Siksha Abhiyan (SSA) : Manual For District-Level Functionaries
No ratings yet
Sarva Siksha Abhiyan (SSA) : Manual For District-Level Functionaries
34 pages
Advanced Word Processing Skills: By: Mark Jhon C. Oxillo
No ratings yet
Advanced Word Processing Skills: By: Mark Jhon C. Oxillo
40 pages
Event Management-Company Profile
No ratings yet
Event Management-Company Profile
2 pages
Goodwin Cable Diagnostics
No ratings yet
Goodwin Cable Diagnostics
19 pages
MIL Module 6 LAS
No ratings yet
MIL Module 6 LAS
4 pages
PERTINENT AND CONTEMPORARY ISSUES (PCIs), PEE, VBE and LSPs
No ratings yet
PERTINENT AND CONTEMPORARY ISSUES (PCIs), PEE, VBE and LSPs
29 pages
b.sc.& Bca II Sem Exam_tt_2025
No ratings yet
b.sc.& Bca II Sem Exam_tt_2025
2 pages
English For Welding 2
No ratings yet
English For Welding 2
3 pages
IGCSE Add Maths
No ratings yet
IGCSE Add Maths
14 pages
SimtekESO Flyer
No ratings yet
SimtekESO Flyer
2 pages
Transportation Engineering.. Case Analysis
No ratings yet
Transportation Engineering.. Case Analysis
112 pages

2103.02440v2

Uploaded by

2103.02440v2

Uploaded by

1

Abstract—Many image-based perception tasks can be formu-

lated as detecting, associating and tracking semantic keypoints,

be viable for self-driving cars. Tracking human poses in real-

(a) (b) (c)

components vx , vy is {s, vx , vy }ij . This is equivalent to “over-

AP AP0.50 AP0.75 APeasy APmedium APhard FPS

AP APM APL t [ms]

PoseTrack 2018 MOTA FPS AP

PoseTrack 2017 MOTA FPS AP

This option would not be used in most real-world settings.

AP AP0.50 AP0.75 APM APL t [ms] tdec [ms]

Sven Kreiss is a postdoc at the Visual Intelligence

Lorenzo Bertoni is a doctoral student at the Visual

You might also like