2103.02440v2
2103.02440v2
OpenPifPaf:
Composite Fields for Semantic Keypoint Detection
and Spatio-Temporal Association
Sven Kreiss, Lorenzo Bertoni, Alexandre Alahi
solves pose detection and tracking with Composite Fields. Doering et al. [10] were the first to introduce a method that
First, we review Composite Fields for single-image multi- is bottom-up in both the spatial and the temporal part. They
person pose estimation [11]. Second, we introduce a new employ Part Affinity Fields [16] for the single-frame poses in
method for pose tracking. While single-frame pose estimation a Siamese architecture. The temporal flow fields (TFF) feed
can be viewed as a pose completion task starting at a seed into an edge cost computation for bipartite graph matching
joint, we treat pose tracking as a pose completion task starting for tracking. The idea is extended in MIPAL [26] for tracking
with a pose from a previous frame and completing a spatio- limbs instead of joints and in STAF [27].
temporal pose, which is a single, connected graph that spans Early work on multi-person pose tracking started with [8],
space and time. The spatio-temporal pose consists of at least [9]. Recent work has shown excellent performance on the
two single-frame poses and additional connections across the PoseTrack 2018 dataset including the top-down method
frames. openSVAI [28] which decomposes the problem into three
The contributions of this paper are (i) a Temporal Composite independent stages of human candidate detection, single-
Association Field (TCAF) which we use to form a spatio- image human pose estimation and pose tracking. Similarly,
temporal pose and (ii) a greedy decoder to jointly detect and LightTrack [29] also builds a strong top-down pipeline with
track poses. To the best of our knowledge, this method is interchangeable and independent modules. Miracle [30] uses
the first single-stage, bottom-up pose detection and tracking a strong single-image pose estimator with a cascaded pyramid
method. We outperform all previous methods in accuracy and network together with an IOU tracker. HRNet for human
speed on the CrowdPose dataset [12] with its particularly pose estimation [20] leverages a multi-resolution backbone to
crowded images. We perform on par with the state-of-the-art produce high resolution feature maps that are context aware
bottom-up method for single-image human pose estimation on via HRNet’s multi-scale fusion. In MSRA/FlowTrack [19],
the COCO [2] keypoint task in precision and are an order optical flow is used to improve top-down tracking of bounding
of magnitude faster in speed. Our model performs on par boxes for tracking of human poses. Pose-Guided Grouping
with the state-of-the-art method for human pose tracking on (PGG) [31] proposes a part association method based on
PoseTrack 2017 and 2018 [13] while simultaneously being separate spatial and temporal embeddings. KeyTrack [32]
an order of magnitude faster during prediction. We also uses pose tokenization and a transformer network to associate
show that our method generalizes to car and animal poses poses.
which demonstrates its suitability for a holistic perception
framework. Our method is implemented as an open source C. Beyond Humans
library, referred to as OpenPifPaf 1 . While many state-of-the-art methods focused on human
body pose detection and tracking, the research community has
II. R ELATED W ORK recently studied their performance on other classes such as
A. Pose Estimation animals and cars. Pose estimation research for animals and
State-of-the-art methods for pose estimation are based on cars has to deal with additional challenges: limited labeled
Convolutional Neural Networks [14], [15], [16], [17], [18], data [33] and large number of self-occlusions [34].
[19], [20], [21], [22], [23], [24]. All approaches for human For animals, datasets are usually small and include limited
pose estimation can be grouped into bottom-up and top-down animal species [35], [33], [36], [37], [38]. To overcome this
methods. The former estimates each body joint first and then issue, DeepLabCut [39] and WS-CDA [33] have developed
groups them to form a unique pose. The latter runs a person transfer learning techniques from humans to animals. Mu et
detector first and estimates body joints within the detected al. [40] have generated a synthetic dataset from CAD animal
bounding boxes. Bottom-up methods were pioneered, e.g., by models and proposed a technique to bridge the real-synthetic
Pishchulin et al. with DeepCut [25]. In their work, the part domain gap. Another line of work has extended the human
association is solved with an integer linear program leading SMPL model [41] to animals to learn simultaneously pose
to processing times for one image of the order of hours. and shape of endangered animals [42], [43], [44].
Newer methods use greedy decoders in combination with For cars, self-occlusions between keypoints are inevitable.
additional tools to reduce prediction time as in Part Affinity A few methods improve performances by estimating 2D and
Fields [16], Associative Embedding [17], PersonLab [18] 3D keypoints of vehicles together. Occlusion-net [34] uses a
and multi-resolution networks with associate embedding [24]. 3D graph network with self-supervision to predict 2D and
PifPaf [11] introduced composite fields for pose estimation 3D keypoints of vehicles using the CarFusion dataset [45],
that produces a more precise association between joints than while GSNet [46] predicts 6DoF car pose and reconstructs
OpenPose’s Part Affinity Fields [16] and PersonLab’s mid- dense 3D shape simultaneously. Without 3D information, the
range fields [18]. In the next section, we will review composite popular OpenPose [47] shows qualitative results for vehicles
fields and show that they generalize to tracking tasks. and Simple Baseline [48] extends a top-down pose estimator
for cars on a custom dataset based on Pascal3D+ [49].
B. Pose Tracking
III. C OMPOSITE F IELDS
Tracking algorithms can be grouped into top-down versus
Our method relies on the Composite Fields formalism to
bottom-up approaches for the pose part and the tracking part.
jointly detect and track semantic keypoints. Hereafter, we
1 https://github.com/vita-epfl/openpifpaf posetrack briefly present them.
3
(a) (b)
Fig. 3. Visualizing the components of the CAF that associates left shoulder with left hip. This is one of the 18 CAF. Every location of the feature map is
the origin of two vectors which point to the shoulders and hips to associate. The confidence of associations ac is shown at their origin in (3a) and the vector
components for ac greater than 0.5 are shown in (3b).
Fig. 5. Model architecture. The input is an image batch of size (H, W ) with three color channels, indicated by “x3”. During joint training on multiple
datasets, the datasets produce image pairs (black arrows for current image at t0 and red arrows for image at t−1 ) whereas during evaluation they produce
single images in a sequence. The neural network based encoder produces composite fields for M joints and N connections. An operation with stride two is
indicated by “//2”. The shared backbone is a ResNet [50] or ShuffleNetV2 [51] without max-pooling. The Feature Cache is only used during evaluation and
injects for every image the previous feature map into the batch. We use a single 1 × 1 convolution in each head network. The TCAF head networks have a
shared pre-processing step consisting of a feature reduction to 512 with a 1 × 1 convolution followed by ReLU, a concatenation of the two feature maps and
another 1 × 1 convolution with ReLU activation. For optional spatial upsampling, we append a sub-pixel convolution layer [52] to each head network. The
decoder converts a set of composite fields into pose estimates. Each semantic keypoint is represented by a confidence score, a real-valued (x, y) coordinate
pair and a size estimate.
accumulation incorporates information of the confidence c, the associations – something that Part Affinity Fields [16] cannot
precisely regressed spatial location (x, y) and the predicted do – which becomes particularly important when we introduce
joint size σ. This map fJ is used to seed the pose decoder our extension for tracking.
and to rescore predicted CAF associations.
c) Composite Association Fields (CAF): Efficiently
forming associations is the core challenge for tracking multiple IV. M ETHOD
poses in a video sequence. The most difficult cases are
crowded scenes and camera angles where people occlude other We aim to present a method that can detect, associate
people – as is the case in the self-driving car perspective where and track semantic keypoints in videos efficiently. We place
pedestrians occlude other pedestrians. Top-down methods first particular emphasis on urban and crowded scenes that are
estimate bounding boxes and then do single-person pose difficult for autonomous vehicles. Many previous methods
estimation per bounding box. This assumes non-overlapping struggle when object bounding boxes overlap. In bird-eye
bounding boxes which is not given in our scenario. Therefore, views from drones or security cameras, bounding boxes are
we focus on bottom-up methods. more separated than in a car driver’s perspective. Here, top-
In [11], we introduced Part Association Fields to connect down methods struggle. Previous bottom-up methods have
joint locations together into poses. Here, we extend this field been trailing top down methods in accuracy without improving
with joint-scale components and call it Composite Association on performance either. Our bottom-up method is efficient,
Field (CAF) to distinguish it better from Part Affinity Fields employs a stable field representation and has high accuracy
introduced in [16]. A graphical review of association fields is and performance that even surpasses top-down methods.
shown in Figure 4 and shows that our CAF expresses the most Figure 5 presents our model architecture. It is a shared
detail about an association. ResNet [50] or ShuffleNetV2 [51] base network without
CAFs predict a confidence, two vectors to the two parts max-pooling. The head networks are shallow and not shared
this association is connecting, two spreads b for the spatial between datasets. In our examples, each dataset has a head
precisions of the regressions (details in Section IV-A) and network for joint intensities (Composite Intensity Field – CIF)
two joint sizes σ. CAFs are represented with aij and a head network for associations (Composite Association
J1 ↔J2 =
Field – CAF). Beyond CIF and CAF, additional head networks
{c, x1 , y1 , x2 , y2 , b1 , b2 , σ1 , σ2 }ij
J1 ↔J2 where J 1 ↔ J2 is the
can be added. In Section IV-B, we introduce the new Temporal
association between body joints J1 and J2 . Predicted asso-
Composite Association Field (TCAF) which is predicted by an
ciations between left shoulders and left hips are shown for
additional head network to facilitate pose tracking.
an example image in Figure 3. In our representation of an
We will introduce a tracking method that is a direct ex-
association, physically meaningful quantities are regressed to
tension of single-image pose estimation. Therefore, we first
continuous variables and do not suffer from the discreteness
introduce our method for single-image pose estimation with
of the feature map. In addition, it is important to represent
particular emphasis on details that will be relevant for pose
associations between two joints that are at the same pixel
tracking.
location. Our representation is stable for these zero-distance
5
than five. The CAF loss has the same structure but with two
localization components (3) and two scale components (4).
b) Self-Hidden Keypoint Suppression: The COCO eval-
uation metric treats visible and hidden keypoints in the same
manner. As in [11], we include hidden keypoints in our
training. However, when a visible and a hidden keypoint
appear close together, we remove the hidden keypoint from the
ground truth annotation so that this keypoint is not included
in associations. In Figure 6, we show the effect of excluding
these self-hidden keypoints from training and observe better
pose reconstruction when a keypoint hides another keypoint
of the same type.
Fig. 6. Effect of self-hidden keypoint suppression during training. The left c) Greedy Decoder with Frontier: The composite fields
image is without and the right image is with self-hidden keypoint suppression. are converted into sets of pose estimates with the greedy
The left hips of both soccer players collide in pixel space. decoder introduced in [11] and reviewed here. The CIF field
and its high-resolution accumulation f (x, y) defined in equa-
tion 1 provide seed locations. Previously, new associations
A. Single-Image Pose Estimation were formed starting at the joint that has currently the highest
a) Loss Functions for Composite Fields: Human pose score without taking the CAF confidence of the association
estimation algorithms tend to struggle with the diversity of into account. Here, we introduce a frontier which is a priority
scales that a human pose can have in an image. While a queue of possible next associations. The frontier is ordered by
localization error for the joint of a large person can be minor, the possible future joint scores which are a function of the
that same absolute error might be a major mistake for a small previous joint score and the best CAF association:
person. Our loss is the logarithm of the probability that all
||~x − (x1 , y1 )||2
ij
components are “well” predicted, i.e., it is the sum of the log- max s(aJ1 ↔J2 , ~x) = c exp − fJ2 (x2 , y2 )
ij σ1
probabilities for the individual components. Each component (6)
follows standard loss prescriptions. We use binary cross en- where ~x is the source joint location, aij J1 ↔J2 =
tropy (BCE) for classification with a Focal loss modification (c, x1 , y1 , x2 , y2 , σ1 , σ2 ) is the CAF field with implied
w [54]. To regress locations in the image, we use the Laplace sub-/superscripts on the components and fJ2 is the high
loss [55] which is an L1 -type loss that is attenuated by a resolution confidence map of the target joint J2 . An
predicted spread b̂ in the location. To regress additional scale association is rejected when it fails reverse matching. To
components (keypoint sizes), we use a Laplace loss with a reduce jitter, we not only use the best CAF association in
fixed spread bσ = 3. The CIF loss function is: the above equation but a weighted mixture of the best two
X associations; similar to blended connections in [56]. Only
LCIF = w(c, ĉ)BCE(c, ĉ) (2) when all possible associations are added to the frontier, the
mc
X1 connection is made to the highest priority in the frontier. This
+ L2 (v, v̂, bmin ) + log b̂ (3) algorithm is fast and greedy. Once a connection to a new
mv b̂ joint has been made, this decision is final.
X 1 ŝ d) Instance Score and Non-Maximum Suppression
+ 1− (4) (NMS): Once all poses are reconstructed, we apply NMS.
m
bs s
s
Poses are first sorted by their instance score which is the
with its three parts for confidence (2), localization (3) and weighted mean of the keypoint scores where the three highest
scale (4) and where: keypoint scores are weighted three times higher. We run NMS
q at the keypoint level as in [11], [18]. The suppression radius
L2 (v, v̂, bmin ) = (v1 − v̂1 )2 + (v2 − v̂2 )2 + b2min . (5) is dynamic and based on the predicted joint size. We do not
refine predictions.
The sums are over masked feature cells mc , mv and mσ with e) Denser Pose Skeletons: Figure 7 gives an overview
i, j, J implied. The mask for confidence mc is almost the entire of the pose skeletons that are used in this paper. In particular,
image apart from regions annotated as “crowd regions” [2]. Figure 7b shows a modification of the standard COCO pose [2]
The masks for localization mv and for scale mσ are only with additional associations. These denser associations are
active in a 4 × 4 window around the ground truth keypoint. redundancies in case of occlusions. The additional associations
Per feature map cell, there is a ground truth confidence c and are longer-range and therefore harder to predict. The frontier in
its predicted counterpart ĉ. The predicted location v̂ = (v̂1 , v̂2 ) our greedy decoder takes this difficulty into account and auto-
is optimized with a Laplace loss with a predicted spread matically prefers easier, confident associations when available.
b̂ for heteroscedastic aleatoric uncertainty [55] with respect Qualitatively, the advantage of dense associations is shown
to the ground truth location v. A bmin = 1px is added to in Figure 8. With the standard COCO skeleton, the single
prevent exploding losses when the spread becomes too small. person’s pose skeleton would be divided into two disconnected
For stability, we clip the BCE loss when it becomes larger parts (left image) as indicated by the two white bounding
6
Fig. 8. Left: A sparse pose cannot connect the right arm to the facial keypoints
(a) (b) (c) (d) leading to the detection of two separate person instances highlighted by the
two white bounding boxes. Right: An additional dense connection between
Fig. 7. A COCO person pose [2] is shown in (a). Additional denser the nose and right shoulder leads to a correctly identified single pose.
connections are shown in lighter colors in (b). The additional connections
provide redundancies in case of occlusions. A pose skeleton as used in
Posetrack with temporal connections is shown in (c). An example of a tracked
pose is shown in (d). The first frame is captured with the right leg (blue) c) From Spatio-Temporal Poses to Tracks: Spatio-
in front and the second frame one step later. For clarity, only connections temporal poses create temporal associations in pairs of images.
that were used to decode the pose are shown and therefore only the temporal
connection that is connecting the right ankle from the past frame to the current
We now introduce our book-keeping method to go from pairs
frame is visible. of images to image sequences. During evaluation and for a
new frame t0 , the decoder creates new tracking poses from
existing tracks (poses in the previous frame t−1 ) or from
boxes. With the additional denser associations, a single pose single-image seeds in the current frame t0 . These partial poses
is formed (right image). are then completed using the same greedy frontier decoder
described for single images. Once all spatio-temporal poses are
B. Pose Tracking complete, the t0 joints are extracted into single-frame poses.
In the previous section we introduced our method for Every single-frame pose is already tagged with an existing
bottom-up pose estimation in single images. We now general- track-id if the spatio-temporal pose was generated from an
ize that method to tracking poses in videos with associations existing track or a new track-id if the spatio-temporal pose
between images in the same bottom-up fashion. Our unified originated from a new seed in the current frame. The single-
approach forms both spatial and temporal associations simulta- frame poses are then filtered with soft NMS [18] and then
neously. This even leads to improved single-image poses from either added to existing tracks or they become the first poses
the additional temporal information. of new tracks.
a) Temporal Composite Association Field (TCAF): Dur- Our method is bottom-up in both pose estimation and
ing training, tracking data is fed into the base network as tracking and estimates temporal and spatial connections within
image pairs that are concatenated in the batch dimension, i.e., a single stage. Most existing work – even other bottom-up
a batched input tensor of eight image pairs has the same shape tracking methods [10], [26] – employ a two stage process
as 16 individual images. where, first, spatial connections are estimated and, second,
During prediction, the backbone processes one image at a temporal connections are made.
time and each image only once. The resulting feature map
is then concatenated with the previous feature map from the V. E XPERIMENTS
“Feature Cache” (see Figure 5). While there is still duplicate Self-driving cars must perceive and predict pedestrians and
computation in the head networks, their computational com- other traffic participants robustly. One of the most challenging
plexity is small. scenarios are crowded places. We will first show experiments
To form associations in image sequences, we introduce the on single-image human pose estimation in CrowdPose [12]
Temporal Composite Association Field (TCAF). Its output which contains particularly challenging scenarios and on the
structure is identical to a CAF field, but its input is based standardized and competitive COCO [2] person keypoint
on pairs of feature maps that were created independently. To benchmark. Then we will show results for pose tracking in
jointly process information from both feature maps, the TCAF videos on the PoseTrack 2017 [9] and 2018 [13] datasets. We
head contains a preprocessing step of a 1×1 input convolution have conducted extensive experiments to show the benefit of
to reduce the feature size to 512 with ReLU non-linearity, a unified bottom-up pose estimation and tracking method with
a concatenation of these two feature maps to 1024 features, spatio-temporal poses. To demonstrate the universality of our
a 1 × 1 convolution with ReLU to process the two images approach, we apply our method also to poses of cars and poses
jointly and a final 1×1 convolution to produce all components of animals.
necessary for a composite association field.
b) Spatio-Temporal Poses: Figure 7c shows a schematic
of a person pose (17 joints and 18 associations) with additional A. Datasets
temporal connections to all joints of the same kind in the a) CrowdPose: In [12], the CrowdPose dataset is pro-
previous frame. In our method, this is treated as a single pose posed. It is a selection of images from other datasets with
with 2 × 17 joints (CIF) and 18 associations (CAF) within the a particular emphasis on how crowded the images are. The
same frame and an additional 17 associations (TCAF) between crowd-index of an image represents the amount of overlap
frames. between person bounding boxes. The authors place particular
7
emphasis on a uniform distribution of the crowd-index in to (96 px)2 and for large instances with a bounding box area
all data partitions. Because this dataset is a composition of larger than (96 px)2 . For each image, pose estimators have to
other datasets and to avoid contamination, our CrowdPose provide the 17 keypoint locations per pose and a total score
models are pretrained on ImageNet [57] and then trained on for each pose. Only the top 20 scoring poses per image are
CrowdPose only. The dataset comes with a split of 10,000 considered for evaluation.
images for training, 2,000 for validation and 8,000 images for b) Pose Tracks: A common metric to evaluate the track-
the test set. ing of human poses is the Multi Object Tracker Accuracy
b) COCO: The de-facto standard for person keypoint (MOTA) [60], [4] which is also the main metric in PoseTrack
prediction is the competitive COCO keypoint task [2]. The test challenges and leaderboards. It combines false positives, false
set is private and powers an active leaderboard via a protected negatives and ID switches into a single metric. We compare
challenge server. COCO contains 56,599 diverse training im- against the best methods that submitted to the PoseTrack 2017
ages with person keypoint annotations. The validation and test- and 2018 evaluation server which computes all metrics on
dev sets contain 5,000 and 20,288 images. private test sets. These methods include strong top-down
c) ApolloCar3D: We generalize our approach to vehicle methods as well as bottom-up methods for pose estimation
keypoints using the ApolloCar3D dataset [58], which contains and tracking.
5,277 driving images at a resolution of 4K and over 60K car
instances. The authors defined 66 semantic keypoints in the
C. Implementation Details
dataset and, for each car, they provided annotations for the
visible ones. For clarity, we choose a subset of 24 semantic a) Neural Network Configuration: All our models are
keypoints and show quantitative and qualitative results on this based on ResNet [50] or ShuffleNetV2 [51] base networks
dataset. and multiple head networks. The base networks have their
d) Animal Dataset: We evaluate the performances of our input max-pooling operation removed as it destroys spatial
algorithm on the Animal-Pose Dataset [33], which provides information. The stride from input image to output feature
annotations for five categories of animals: dog, cat, cow, horse, map is 16 with 2048 features at each location. We apply
sheep for a total of 20 keypoints. The dataset includes 5,517 no additional modifications to the standard ResNet models.
instances in more than 3,000 images. The majority of these We use the standard building blocks of ShuffleNetV2 back-
images originally belong to the VOC dataset [59]. bones to construct our custom configurations which we denote
e) PoseTrack 2017 and 2018: We conduct quantitative ShuffleNetV2K16/K30. A ShuffleNetV2K16 model has the
studies of our tracking performance on the PoseTrack 2017 [9] prediction accuracy of a ResNet50 with fewer parameters than
and 2018 [13] datasets. The datasets contain short video a ResNet18. The configuration is specified by the number of
sequences of annotated and tracked human poses in diverse output features of the five stages and the number of repetitions
situations. The PoseTrack 2018 dataset contains 593 training of the blocks in each stage. Our ShuffleNetV2K16 has output
scenes, 170 validation scenes and 375 test scenes. The test features (block repeats) of 24 (1), 348 (4), 696 (8), 1392 (4),
labels are private. PoseTrack 2017 is a subset of the 2018 1392 (1) and our ShuffleNetV2K30 has 32 (1), 512 (8), 1024
dataset with 292 train, 50 validation and 208 test scenes. (16), 2048 (6), 2048 (1). Spatial 3×3 convolutions are replaced
However, the 2018 leaderboard is frozen and new results are with 5×5 convolutions which introduces only a small increase
only updated for the 2017 leaderboard. Therefore, many recent in the number of parameters because all spatial convolutions
methods present results on the older, smaller dataset. Here, we are depth-wise.
will report numbers for both 2017 and 2018. Each head network is a single 1×1 convolution followed by
a sub-pixel convolution [52] to double the spatial resolution
bringing the total stride down to eight. Therefore, the spatial
B. Evaluation feature map size for an input image of 801px×801px is 101×
a) Single-Image Multi-Person Poses: Both CrowdPose 101. The confidence component of a field is normalized with a
and COCO follow COCO’s keypoint evaluation method. The sigmoid non-linearity and the scale components for joint-sizes
object keypoint similarity (OKS) score [2] is used to assign are enforced to be positive with a softplus [61].
a bounding box to each keypoint as a function of the person b) Augmentations: We apply the standard augmentations
instance bounding box area. Similar to detection, the metric of random horizontal flipping, random rescaling with a rescal-
computes overlaps between ground truth and predicted bound- ing factor r ∈ [0.5, 2.0], random cropping and padding to
ing boxes to compute the standard detection metrics average 385 × 385 followed by color jittering with 40% variation in
precision (AP) and average recall (AR). brightness and saturation and 10% variation in hue. We also
CrowdPose breaks down the test set at the image level convert a random 1% of the images to grayscale and generate
into easy, medium and hard. The easy set contains images strong JPEG compression artifacts in 10% of the images.
with a crowd index in [0, 0.1], the medium set in [0.1, 0.8] The tracking task is similarly augmented. The random
and the hard set in [0.8, 1.0]. Given the uniform crowd-index rescaling is adapted to an image width in [0.5×801, 1.5×801]
distribution, most images of the test set are in the medium and random cropping to a maximum image side of 385 px.
category. Half of the image pairs are randomly reoriented (rotations by
COCO breaks down the precision scores at the instance level multiples of 90◦ ). To increase the inter-frame variations, we
for medium instances with a bounding box area of (32 px)2 add a small synthetic camera shift of maximum 30 px between
8
Fig. 9. Illustration of OpenPifPaf predictions from the CrowdPose [12] val set with crowd-index hard on a sports scene, a family photo and a street scene.
image pairs. To further increase the variation, we form image synthetic shifts of up to 30px. Starting from a trained single-
pairs with a random interval of 4, 8 and 12 frames. In 20% image pose backbone, we train on both datasets with SGD [62]
of image pairs, we replace one of the images with a random with the same configuration as for single images. We alternate
image to provide a higher number of negative samples for the dataset every batch and only do an SGD-step every two
tracking. batches. We train for 50 epochs where every epoch consists
c) Single-Image Training: For ResNet [50] backbones, of 4994 batches. The training time is 55 minutes per epoch
we use ImageNet [57] pretrained models. ShuffleNetV2 [51] on two V100 GPUs.
models are trained from random initializations. We use the
SGD [62] optimizer with Nesterov momentum [63] of 0.95, D. Results
batch size of 32 and weight decay of 10−5 . The learning rate
a) Crowded Single-Image Pose Estimation: In Figure 9,
is exponentially warmed up for one epoch from 10−3 of its
we show example pose predictions from the CrowdPose [12]
target value. At certain epochs (specified below), the learning
validation set. We show results in a diverse selection of sports
rate is exponentially decayed over 10 epochs by a factor of
disciplines and everyday settings. All shown images are from
10. We employ model averaging [64], [65] to extract stable
the hard subset with a crowd-index larger than 0.8.
models for validation. At each optimization step, we update
In Table I, we show a quantitative comparison of our
an exponentially weighted version of the model parameters
performance with other methods. We are not only more
with a decay constant of 10−2 .
precise across all precision metrics AP, AP0.50 , AP0.75 , APeasy ,
On CrowdPose, which is a smaller dataset than COCO, we APmedium and APhard but also predict faster than all previous
train for 300 epochs. We set the target learning rate to 10−5 top-performing methods at 13.7 FPS (frames-per-second) on
and decay at epochs 250 and 280. a single GTX1080Ti.
On COCO, we use a target learning rate of 10−4 and decay b) COCO: All state-of-the-art methods compare their
at epoch 130 and 140. The training time for 150 epochs of a performance on the well-established COCO keypoint task [2].
ShuffleNetV2K16 on two V100 is approximately 37 hours. We Our quantitative results on the private 2017 test-dev set are
do not use any additional datasets beyond the COCO keypoint shown in Table II along with other bottom-up methods.
annotations. This comparison includes field-based methods [16], [18], [11]
d) Training for Tracking on PoseTrack: We use the and methods based on associative embedding [17], [24]. We
ShuffleNetV2k30 backbone for all our tracking experiments. perform on par with the best existing bottom-up method. We
PoseTrack 2018 is a video dataset which means that despite evaluate on rescaled images where the longer edge is 801 px
a large number of annotations, the variation is smaller than in which is the same image size that will be used for tracking
single-image pose datasets. Therefore, we keep single-image below. We evaluate a single forward pass without horizontal
pose estimation on the COCO dataset [2] as an auxiliary task flipping and without multi-scale evaluation because we aim for
and train on PoseTrack and COCO simultaneously. The type a fast method. The average time per image with a GTX1080Ti
of poses that are annotated in the two datasets are similar is 152 ms (63 ms on a V100) of which 29 ms is used for
but not identical, e.g., one dataset annotates the eyes and the decoding.
other does not. During training, we alternate the two tasks c) Pose Tracking: We want to track multiple human
between batches. In one batch we feed pairs of images from poses in videos. We train and validate on the PoseTrack 2018
the PoseTrack dataset and apply losses to the corresponding dataset [13]. Table III shows our main results for pose tracking
head networks and in the next batch we feed in single images on both of the private test sets of Posetrack 2017 and 2018.
from COCO and apply losses to the other head networks (see We also show our single-image average precision (AP) which
Figure 5). The COCO task is trained identical to the single- highlights that our performant tracking method can compen-
image pose estimation discussed in the previous section, but sate for a lower AP, e.g., compared to MSRA/FlowTrack [19],
converted from single images to pairs of tracked images via and still outperform in overall MOTA and FPS. All our results
9
TABLE I
E VALUATION ON THE C ROWD P OSE TEST DATASET [12]. O UR O PEN P IF PAF RESULT IS BASED ON A R ES N ET 50 BACKBONE WITH SINGLE - SCALE
∗
EVALUATION AT 641 PX . VALUES EXTRACTED FROM C ROWD P OSE PAPER [12]. + E MPLOYS MULTI - SCALE TESTING .
TABLE II
E VALUATION METRICS FOR THE COCO 2017 TEST- DEV DATASET FOR
BOTTOM - UP METHODS . N UMBERS ARE EXTRACTED FROM THE
RESPECTIVE PAPERS . O UR PREDICTION TIME IS DETERMINED ON A
SINGLE V100 GPU. ∗ O NLY EVALUATING IMAGES WITH THREE PERSON
INSTANCES .
TABLE III
E VALUATION METRICS ON THE TEST SETS OF ( A ) P OSE T RACK 2018 [13]
AND ( B ) P OSE T RACK 2017 [9]. N UMBERS ARE EXTRACTED FROM THE
RESPECTIVE PAPERS AND THE LEADERBOARD . A LL METHODS ARE
ONLINE METHODS APART FROM D ET T RACK [67].
Fig. 13. Qualitative results from the KITTI [68] and ApolloCar3D [58] datasets. We resolve distant pedestrians, cyclists and cars and handle changing
lighting conditions well.
TABLE V
A BLATION STUDIES OF SKELETON CHOICE AND DECODER CONFIGURATIONS FOR SINGLE - IMAGE POSE ESTIMATION . A LL RESULTS ( EXCEPT WHERE
EXPLICITLY STATED OTHERWISE ) ARE PRODUCED WITH THE SAME S HUFFLE N ET V2 K 16 MODEL ON THE COCO VAL SET [2] ON A SINGLE GTX1080T I .
F IRST, WE REVIEW DIFFERENT BACKBONE ARCHITECTURES ( A R ES N ET 50 [50] AND A LARGER S HUFFLE N ET V2 [51]). S ECOND , WE SHOW THAT ONLY
USING CONFIDENT KEYPOINTS LEADS TO A LARGE DROP IN PRECISION . T HIRD , WE OBSERVE THAT THE F RONTIER DECODER IS MORE IMPORTANT FOR
DENSER SKELETONS WHILE INCURRING ALMOST NO OVERHEAD ON SPARSE SKELETONS . F OURTH , WE CAN PRODUCE A MEMORY- EFFICIENT VERSION
OF OUR DECODER AT A COST OF 1.4% IN AP. T HE BIGGEST DROP IN ACCURACY COMES FROM NOT RESCORING THE CAF FIELD AND THE LARGEST
CONTRIBUTOR TO INCREASING THE INFERENCE TIME IS NOT RESCORING THE SEEDS .
R EFERENCES
[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human
pose estimation: New benchmark and state of the art analysis,” in IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June
2014. 1
[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in con-
text,” in European Conference on Computer Vision (ECCV). Springer,
2014, pp. 740–755. 1, 2, 5, 6, 7, 8, 11
[3] A. Crow, “How safe are self-driving cars?” Rocky Mountain Institute,
5 2017. [Online]. Available: https://rmi.org/safe-self-driving-cars/ 1
Fig. 14. Qualitative results from the Animal-Pose dataset [33]. The left [4] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A
image was processed by a person model and an animal model. benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831,
2016. 1, 7
[5] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fer-
TABLE VI nandez, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual
BASELINES AND ABLATION STUDIES ON THE P OSE T RACK 2018 object tracking vot2015 challenge results,” in Proceedings of the IEEE
VALIDATION SET [13] ON A SINGLE V100 GPU. W E OUTPERFORM international conference on computer vision workshops, 2015, pp. 1–23.
H UNGARIAN TRACKERS WITH EUCLIDEAN AND OKS DISTANCE 1
FUNCTIONS IN ACCURACY FOR A SMALL OVERHEAD IN FPS. W E ALSO [6] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Ce-
STUDY OUR SENSITIVITY TO THE INPUT IMAGE SIZE . F OR IMAGE SIZES OF hovin Zajc, T. Vojir, G. Hager, A. Lukezic, A. Eldesokey et al., “The
513 PX , WE OBSERVE A DROP OF 2.9 IN MOTA BUT RUN 82% FASTER AT visual object tracking vot2017 challenge results,” in Proceedings of the
22.2 FPS. IEEE international conference on computer vision workshops, 2017, pp.
1949–1972. 1
MOTA FPS [7] B. D. Lucas, T. Kanade et al., “An iterative image registration technique
original (801px) 66.4 12.2 with an application to stereo vision,” 1981. 1
Hungarian euclidean -1.5 +4% [8] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov,
OKS -2.0 +1% B. Andres, and B. Schiele, “Arttrack: Articulated multi-person tracking
Image size 513px -2.9 +82% in the wild,” in Conference on Computer Vision and Pattern Recognition
641px -0.9 +37% (CVPR), 2017, pp. 6457–6465. 1, 2
1201px -1.7 -49% [9] U. Iqbal, A. Milan, and J. Gall, “Posetrack: Joint multi-person pose
estimation and tracking,” in Conference on Computer Vision and Pattern
Recognition (CVPR), 2017, pp. 2011–2020. 1, 2, 6, 7, 9
[10] A. Doering, U. Iqbal, and J. Gall, “Joint flow: Temporal flow fields for
f) Tracking Ablation: We studied the effect of input multi person tracking,” arXiv preprint arXiv:1805.04596, 2018. 1, 2, 6
image size at the bottom of Table VI. Our “original” model [11] S. Kreiss, L. Bertoni, and A. Alahi, “Pifpaf: Composite fields for
human pose estimation,” in Conference on Computer Vision and Pattern
rescales the image width to 801px. Larger images do not Recognition (CVPR), June 2019. 2, 3, 4, 5, 8
show an improvement in accuracy (MOTA) while becoming [12] J. Li, C. Wang, H. Zhu, Y. Mao, H.-S. Fang, and C. Lu, “Crowdpose:
significantly slower. Smaller input images decrease MOTA Efficient crowded scenes pose estimation and a new benchmark,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
but at the same time can drastically increase speed. Most Recognition, 2019, pp. 10 863–10 872. 2, 6, 8, 9
applications can probably tolerate an accuracy reduction by [13] M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall,
0.9 in MOTA to improve speed by +37%. When the input and B. Schiele, “Posetrack: A benchmark for human pose estimation and
tracking,” in Conference on Computer Vision and Pattern Recognition
image size is reduced to 513px, MOTA drops by 2.9 (still a (CVPR), 2018, pp. 5167–5176. 2, 6, 7, 8, 9, 10, 12
great result) which comes with a speed improvement of +82% [14] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via
to a fast 22.2 FPS. deep neural networks,” in Conference on Computer Vision and Pattern
Recognition (CVPR), 2014, pp. 1653–1660. 2
[15] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
VI. C ONCLUSIONS Computer Vision (ICCV), 2017 IEEE International Conference on.
IEEE, 2017, pp. 2980–2988. 2, 9
We have demonstrated a new method for bottom-up pose [16] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d
tracking for 2D human poses and shown its strength in pose estimation using part affinity fields,” in Conference on Computer
Vision and Pattern Recognition (CVPR), 2017, pp. 7291–7299. 2, 3, 4,
crowded and occluded scenes that are relevant for perception 8, 9, 10
in self-driving cars and social robots. We outperform previous [17] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-to-
state-of-the-art methods on CrowdPose and on PoseTrack2018. end learning for joint detection and grouping,” in Advances in Neural
Information Processing Systems, 2017, pp. 2277–2287. 2, 8, 9
On PoseTrack2017 we are on par with the state-of-the-art but [18] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Mur-
run an order of magnitude faster. We have also shown that our phy, “Personlab: Person pose estimation and instance segmentation with
method generalizes to pose estimation of cars and animals. a bottom-up, part-based, geometric embedding model,” in European
Conference on Computer Vision (ECCV), 2018, pp. 269–286. 2, 3, 5,
We can run all versions simultaneously on an image sequence 6, 8, 9
and form the union of the predictions. In the future, we can [19] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose
investigate shared backbone architectures to create a holistic estimation and tracking,” in Proceedings of the European conference
on computer vision (ECCV), 2018, pp. 466–481. 2, 8, 9
perception framework for autonomous vehicles. [20] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen-
tation learning for human pose estimation,” in Conference on Computer
VII. ACKNOWLEDGEMENTS Vision and Pattern Recognition (CVPR), 2019, pp. 5693–5703. 2, 9
[21] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional
This work was supported by the Swiss National Science pose machines,” in Conference on Computer Vision and Pattern Recog-
Foundation under the Grant 2OOO21-L92326 and the SNSF nition (CVPR), 2016, pp. 4724–4732. 2, 10
[22] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
Spark fund (190677). We also thank our lab members and human pose estimation,” in European Conference on Computer Vision
reviewers for their valuable comments. (ECCV). Springer, 2016, pp. 483–499. 2
13
[23] M. Kocabas, S. Karagoz, and E. Akbas, “Multiposenet: Fast multi-person [44] S. Zuffi, A. Kanazawa, T. Berger-Wolf, and M. Black, “Three-d safari:
pose estimation using pose residual network,” in European Conference Learning to estimate zebra pose, shape, and texture from images “in
on Computer Vision (ECCV), 2018, pp. 417–433. 2, 9 the wild”,” in Proceedings of the IEEE International Conference on
[24] B. Cheng, B. Xiao, J. Wang, H. Shi, T. S. Huang, and L. Zhang, Computer Vision (ICCV), 2019, pp. 5358–5367. 2
“Higherhrnet: Scale-aware representation learning for bottom-up human [45] N. Dinesh Reddy, M. Vo, and S. G. Narasimhan, “Carfusion: Combining
pose estimation,” in Proceedings of the IEEE/CVF Conference on point tracking and part detection for dynamic 3d reconstruction of
Computer Vision and Pattern Recognition, 2020, pp. 5386–5395. 2, vehicles,” in Proceedings of the IEEE Conference on Computer Vision
8, 9 and Pattern Recognition (CVPR), 2018, pp. 1906–1915. 2
[25] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. [46] L. Ke, S. Li, Y. Sun, Y.-W. Tai, and C.-K. Tang, “Gsnet: Joint vehicle
Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling for pose and shape reconstruction with geometrical and scene-aware super-
multi person pose estimation,” in Conference on Computer Vision and vision,” in European Conference on Computer Vision (ECCV). Springer,
Pattern Recognition (CVPR), 2016, pp. 4929–4937. 2 2020, pp. 515–532. 2
[26] J. Hwang, J. Lee, S. Park, and N. Kwak, “Pose estimator and tracker [47] Z. Cao, G. H. Martinez, T. Simon, S.-E. Wei, and Y. A. Sheikh,
using temporal flow maps for limbs,” in 2019 International Joint “Openpose: realtime multi-person 2d pose estimation using part affinity
Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8. 2, fields,” IEEE transactions on pattern analysis and machine intelligence,
6, 9 2019. 2
[27] Y. Raaj, H. Idrees, G. Hidalgo, and Y. Sheikh, “Efficient online multi- [48] H. C. Sánchez, A. H. Martı́nez, R. I. Gonzalo, N. H. Parra, I. P. Alonso,
person 2d pose tracking with recurrent spatio-temporal affinity fields,” and D. Fernandez-Llorca, “Simple baseline for vehicle pose estimation:
in Conference on Computer Vision and Pattern Recognition (CVPR), Experimental validation,” IEEE Access, vol. 8, pp. 132 539–132 550,
2019, pp. 4620–4628. 2, 9 2020. 2
[28] G. Ning, P. Liu, X. Fan, and C. Zhang, “A top-down approach to ar- [49] Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark
ticulated human pose estimation and tracking,” in European Conference for 3d object detection in the wild,” in Proceeding of the IEEE Winter
on Computer Vision (ECCV), 2018, pp. 0–0. 2, 9 Conference on Applications of Computer Vision (WACV). IEEE, 2014,
[29] G. Ning, J. Pei, and H. Huang, “Lighttrack: A generic framework for pp. 75–82. 2
online top-down human pose tracking,” in Proceedings of the IEEE/CVF [50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Conference on Computer Vision and Pattern Recognition Workshops, recognition,” in Conference on Computer Vision and Pattern Recognition
2020, pp. 1034–1035. 2, 9 (CVPR), 2016, pp. 770–778. 4, 7, 8, 11
[30] D. Yu, K. Su, J. Sun, and C. Wang, “Multi-person pose estimation for [51] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical
pose tracking with enhanced cascaded pyramid network,” in European guidelines for efficient cnn architecture design,” in European Conference
Conference on Computer Vision (ECCV), 2018. 2, 9 on Computer Vision (ECCV), 2018, pp. 116–131. 4, 7, 8, 11
[31] S. Jin, W. Liu, W. Ouyang, and C. Qian, “Multi-person articulated [52] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
tracking with spatial and temporal embeddings,” in Conference on D. Rueckert, and Z. Wang, “Real-time single image and video super-
Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5664– resolution using an efficient sub-pixel convolutional neural network,” in
5673. 2 Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
[32] M. Snower, A. Kadav, F. Lai, and H. P. Graf, “15 keypoints is all you pp. 1874–1883. 4, 7
need,” arXiv preprint arXiv:1912.02323, 2019. 2, 9 [53] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler,
[33] J. Cao, H. Tang, H.-S. Fang, X. Shen, C. Lu, and Y.-W. Tai, “Cross- and K. Murphy, “Towards accurate multi-person pose estimation in
domain adaptation for animal pose estimation,” in Proceedings of the the wild,” in Conference on Computer Vision and Pattern Recognition
IEEE International Conference on Computer Vision (ICCV), 2019, pp. (CVPR), vol. 3, no. 4, 2017, p. 6. 3
9498–9507. 2, 7, 10, 12 [54] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
[34] N. D. Reddy, M. Vo, and S. G. Narasimhan, “Occlusion-net: 2d/3d for dense object detection,” in Proceedings of the IEEE international
occluded keypoint localization using graph networks,” in Proceedings conference on computer vision (ICCV), 2017, pp. 2980–2988. 5
of the IEEE Conference on Computer Vision and Pattern Recognition [55] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian
(CVPR), 2019, pp. 7326–7335. 2 deep learning for computer vision?” in Advances in neural information
[35] T. D. Pereira, D. E. Aldarondo, L. Willmore, M. Kislin, S. S.-H. Wang, processing systems, 2017, pp. 5574–5584. 5
M. Murthy, and J. W. Shaevitz, “Fast animal pose estimation using deep [56] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, and
neural networks,” Nature methods, vol. 16, no. 1, pp. 117–125, 2019. 2 M. Grundmann, “Blazeface: Sub-millisecond neural face detection on
[36] B. Biggs, O. Boyne, J. Charles, A. Fitzgibbon, and R. Cipolla, “Who left mobile gpus,” arXiv preprint arXiv:1907.05047, 2019. 5
the dogs out? 3d animal reconstruction with expectation maximization [57] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
in the loop,” in European Conference on Computer Vision (ECCV). A large-scale hierarchical image database,” in Computer Vision and
Springer, 2020, pp. 195–211. 2 Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,
[37] S. Li, J. Li, H. Tang, R. Qian, and W. Lin, “Atrw: A benchmark for 2009, pp. 248–255. 7, 8
amur tiger re-identification in the wild,” in Proceedings of the 28th [58] X. Song, P. Wang, D. Zhou, R. Zhu, C. Guan, Y. Dai, H. Su, H. Li, and
ACM International Conference on Multimedia. New York, NY, USA: R. Yang, “Apollocar3d: A large 3d car instance understanding bench-
Association for Computing Machinery, 2020, p. 2590–2598. 2 mark for autonomous driving,” in Proceedings of the IEEE Conference
[38] A. Mathis, M. Yüksekgönül, B. Rogers, M. Bethge, and M. W. Mathis, on Computer Vision and Pattern Recognition, 2019, pp. 5452–5462. 7,
“Pretraining boosts out-of-domain robustness for pose estimation,” in 10, 11
Proceeding of the IEEE Winter Conference on Applications of Computer [59] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn,
Vision (WACV), 2021. 2 and A. Zisserman, “The pascal visual object classes challenge: A
[39] A. Mathis, P. Mamidanna, K. M. Cury, T. Abe, V. N. Murthy, M. W. retrospective,” International journal of computer vision, vol. 111, no. 1,
Mathis, and M. Bethge, “Deeplabcut: markerless pose estimation of user- pp. 98–136, 2015. 7
defined body parts with deep learning,” Nature Publishing Group, Tech. [60] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking
Rep., 2018. 2 performance: the clear mot metrics,” EURASIP Journal on Image and
[40] J. Mu, W. Qiu, G. D. Hager, and A. L. Yuille, “Learning from synthetic Video Processing, vol. 2008, pp. 1–10, 2008. 7
animals,” in Proceedings of the IEEE Conference on Computer Vision [61] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia, “Incor-
and Pattern Recognition (CVPR), 2020, pp. 12 386–12 395. 2 porating second-order functional knowledge for better option pricing,”
[41] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, Advances in neural information processing systems, vol. 13, pp. 472–
“Smpl: A skinned multi-person linear model,” ACM transactions on 478, 2000. 7
graphics (TOG), vol. 34, no. 6, pp. 1–16, 2015. 2 [62] L. Bottou, “Large-scale machine learning with stochastic gradient de-
[42] S. Zuffi, A. Kanazawa, and M. J. Black, “Lions and tigers and bears: scent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–
Capturing non-rigid, 3d, articulated shape from images,” in Proceedings 186. 8
of the IEEE Conference on Computer Vision and Pattern Recognition [63] Y. Nesterov, “A method of solving a convex programming problem with
(CVPR), 2018, pp. 3955–3963. 2 convergence rate o(1/k2),” in Soviet Mathematics Doklady, vol. 27, no. 2,
[43] B. Biggs, T. Roddick, A. Fitzgibbon, and R. Cipolla, “Creatures great 1983, pp. 372–376. 8
and smal: Recovering the shape and motion of animals from video,” in [64] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
Asian Conference on Computer Vision (ACCV). Springer, 2018, pp. tion by averaging,” SIAM journal on control and optimization, vol. 30,
3–19. 2 no. 4, pp. 838–855, 1992. 8
14
[65] D. Ruppert, “Efficient estimations from a slowly convergent robbins- Alexandre Alahi is an Assistant Professor at EPFL.
monro process,” Cornell University Operations Research and Industrial He spent five years at Stanford University as a
Engineering, Tech. Rep., 1988. 8 Post-doc and Research Scientist after obtaining his
[66] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “Rmpe: Regional multi- Ph.D. from EPFL. His research enables machines to
person pose estimation,” in International Conference on Computer perceive the world and make decisions in the context
Vision (ICCV), 2017, pp. 2334–2343. 9 of transportation problems and smart environments.
[67] M. Wang, J. Tighe, and D. Modolo, “Combining detection and tracking He has worked on the theoretical challenges and
for human pose estimation in videos,” in Proceedings of the IEEE/CVF practical applications of socially-aware Artificial In-
Conference on Computer Vision and Pattern Recognition, 2020, pp. telligence, i.e., systems equipped with perception
11 088–11 096. 9 and social intelligence. He was awarded the Swiss
[68] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: NSF early and advanced researcher grants for his
The kitti dataset,” International Journal of Robotics Research (IJRR), work on predicting human social behavior. Alexandre has also co-founded
2013. 10, 11 multiple startups such as Visiosafe, and won several startup competitions. He
[69] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object was elected as one of the Top 20 Swiss Venture leaders in 2010.
detection network for autonomous driving,” in Proceedings of the IEEE
conference on Computer Vision and Pattern Recognition, 2017, pp.
1907–1915. 10
[70] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and
R. Urtasun, “3d object proposals for accurate object class detection,” in
Advances in Neural Information Processing Systems. Citeseer, 2015,
pp. 424–432. 10
[71] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and
P. Kontschieder, “Disentangling monocular 3d object detection,” in
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2019, pp. 1991–1999. 10
[72] Z. Liu, Z. Wu, and R. Tóth, “Smoke: Single-stage monocular 3d object
detection via keypoint estimation,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops,
2020, pp. 996–997. 10
[73] J. Ku, A. D. Pon, and S. L. Waslander, “Monocular 3d object detection
leveraging accurate proposals and shape reconstruction,” in Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition,
2019, pp. 11 867–11 876. 10
[74] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv
preprint arXiv:1904.07850, 2019. 10
[75] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval
research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955. 10