0% found this document useful (0 votes)
97 views15 pages

Large Scale Datasets and Predictive Methods For 3D Human Sensing in Natural Environments

This document introduces the Human3.6M dataset, which contains 3.6 million accurate 3D human poses captured from videos of 11 actors performing various activities like taking photos, talking on the phone, and eating. This is several orders of magnitude larger than existing datasets. It aims to provide more diverse poses and complement existing datasets with synchronized image, motion capture, and depth data. Large-scale statistical models and evaluation baselines are provided to illustrate the dataset's diversity and potential for improving 3D human pose estimation with more data and complex models.

Uploaded by

Travis Bennett
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views15 pages

Large Scale Datasets and Predictive Methods For 3D Human Sensing in Natural Environments

This document introduces the Human3.6M dataset, which contains 3.6 million accurate 3D human poses captured from videos of 11 actors performing various activities like taking photos, talking on the phone, and eating. This is several orders of magnitude larger than existing datasets. It aims to provide more diverse poses and complement existing datasets with synchronized image, motion capture, and depth data. Large-scale statistical models and evaluation baselines are provided to illustrate the dataset's diversity and potential for improving 3D human pose estimation with more data and complex models.

Uploaded by

Travis Bennett
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 1

Human3.6M:
Large Scale Datasets and Predictive Methods
for 3D Human Sensing in Natural Environments
Catalin Ionescu∗†‡ , Dragos Papava∗‡ , Vlad Olaru∗ , Cristian Sminchisescu§∗

Abstract—We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance
of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next
generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state of the art
by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as
part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image,
human motion capture and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also
provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using
correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of
large scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement
by future work in the research community. Our experiments show that our best large scale model can leverage our full training set to
obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem.
Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and
should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization
tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m.

Index Terms—3D human pose estimation, human motion capture data, articulated body modeling, optimization, large scale learning,
structured prediction, Fourier kernel approximations.

1 I NTRODUCTION also require comprehensive scene modeling, beyond just


Accurately reconstructing the 3D human poses of peo- the humans in the scene. Such image understanding
ple from real images, in a variety of indoor and out- scenarios stretch the ability of the pose sensing system
door scenarios, has a broad spectrum of applications to exploit prior knowledge and structural correlations,
in entertainment, environmental awareness, or human- by using the incomplete visible information in order
computer interaction[1], [2], [3]. Over the past 15 years to constrain estimates of unobserved body parts. One
the field has made significant progress fueled by new of the key challenges for trainable systems is insuffi-
optimization and modeling methodology, discriminative cient data coverage. Existing state of the art datasets
methods, feature design and standardized datasets for like HumanEva[4], contain about 40,000 different poses
model training. It is now widely agreed that any suc- and the class of motions covered is somewhat small,
cessful human sensing system, be it generative, discrim- reflecting its design purpose geared primarily towards
inative or combined, would need a significant training algorithm evaluation. In contrast, while we want to
component, together with strong constraints from image continue to be able to offer difficult benchmarks, we
measurements, in order to be successful, particularly also wish to collect datasets that can be used to build
under monocular viewing and (self-) occlusion. Such operational systems for realistic environments. People
situations are not infrequent but rather commonplace in the real world move less regularly than assumed in
in the analysis of images acquired in real world situ- many existing datasets. Consider the case of a pedes-
ations. Yet these images cannot be handled well with trian, for instance. It is not that frequent, particularly in
the human models and training tools currently available busy urban environments, to encounter ‘perfect’ walkers.
in computer vision. Part of the problem is that hu- Driven by their daily tasks, people carry bags, walk
mans are highly flexible, move in complex ways against with hands in their pockets and gesticulate when talking
natural backgrounds, and their clothing and muscles to other people or on the phone. Since the human
deform. Other confounding factors like occlusion may kinematic space is too large to be sampled regularly and
densely, we chose to collect data by focusing on a set
∗ Institute of Mathematics of the Romanian Academy. of poses which are likely to be of interest because they
§ Department of Mathematics, Faculty of Engineering, Lund University.
† Faculty of Mathematics and Natural Sciences, University of Bonn. are common in urban and office scenes. The poses are
‡ These authors contributed equally. Corresponding authors: V. Olaru derived from 15 chosen scenarios for which our actors
([email protected]), C. Sminchisescu ([email protected]). were given general instructions, but were also left ample
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 2

freedom to improvise. This choice helps us cover more based on estimates of the camera motion and its internal
densely some of the common pose variations and at parameters, reconstructions of the 3D environment and
the same time control the difference between training ground plane estimates.
and testing data (or covariate shift[5]) without placing Online Large-Scale Models, Features, Visualization
unrealistic restrictions on their similarity. However that and Evaluation Tools: We provide online models for
variability within daily tasks like “talking on the phone” feature extraction as well as pose estimation, including
or “Eating” is subtle as functionally, similar programs linear and kernel regressors and structured predictors
are being performed, irrespective of the exact execution. based on kernel dependency estimation. All these mod-
In contrast, the distributions of any two such different els are complemented with linear Fourier approxima-
scenarios are likely to contain wider separated poses, tions, in order to allow the training of non-linear kernel
although the manifolds from which this data is sampled models at large scale. The design of such models is
may intersect. currently non-trivial and the task of processing millions
In this paper we present a large dataset collected us- of images and 3D poses, or training using such large
ing accurate marker-based motion capture systems and repositories, remains daunting for most existing human
actors dressed with moderately realistic clothing, viewed pose estimation methodologies. We also supply methods
against indoor backgrounds. Other recent experimental for background subtraction and for extracting the bound-
systems have explored the possibility of unconstrained ing boxes of people, as well as a variety of precomputed
capture based on non-invasive sensors or attached body features (pyramids of SIFT grids) over these, in order to
cameras[6], [7], and could represent attractive alterna- allow rapid prototyping, experimentation, and parallel
tives, as they develop, in the long run. As technology work streams in both computer vision and machine
matures, progress on all fronts is welcome, particularly learning. Software for the visualization of skeleton rep-
as the data we provide is complementary in its choice resentations based on 3D joint positions as well as 3D
of poses, with respect to existing datasets [8], [4], [7]. joint angle formats is provided, too.
Even by means of a combined community effort, we are
not likely to be able to densely sample or easily handle
the 30+ dimensional space of all human poses. However, 1.1 Related Work
an emphasis on typical scenarios and larger datasets, in Over the past decade, inferring the 3D human pose from
line with current efforts in visual recognition, may still images or video has received significant attention in
offer a degree of prior knowledge bootstrapping that can the research community. While a comprehensive survey
significantly improve the performance of existing human would be impossible, we refer the reader to recently
sensing systems. Specifically, by design, we aim to cover edited volumes by Moeslund et al. [1] and Rosenhahn et
the following aspects: al. [2] as well as [3], [9] for a comprehensive overview.
Large Set of Human Poses, Diverse Motion and Ac- Initially, work in 3D human sensing focused on 3D
tivity Scenarios: We collected over 3.6 million different body modeling and relied on non-linear optimization
human poses, viewed from 4 different angles, using an techniques. More recently, the interest shifted somewhat
accurate human motion capture system. The motions towards systems where components are trained based
were executed by 11 professional actors, and cover a di- on datasets of human motion capture. Within the realm
verse set of everyday scenarios including conversations, of 3D pose inference, some methods focus on auto-
eating, greeting, talking on the phone, posing, sitting, matic discriminative prediction [10], [11], [12], [13], [14],
smoking, taking photos, waiting, walking in various whereas others aim at model-image alignment [15], [16],
non-typical scenarios (with a hand in the pocket, talking [17], [18], [19], [20], [21], [22], [23] or accurate modeling
on the phone, walking a dog, or buying an item). of 3D shape[24], [25] or clothing[26]. This process is
Synchronized Modalities, 2D and 3D data, Subject ongoing and was made possible by the availability of
Body Scans: We collect and fully synchronize both the 3D human motion capture[8], [4], as well as human body
2D and the 3D data, in particular images from 4 high- scan datasets like the commercially available CAESAR,
speed progressive scan, high-resolution video cameras, or smaller academic repositories like SCAPE[27] and
a time of flight (TOF) depth sensor, as well as human INRIA4D[28]. Training models for 3D human sensing
motion capture data acquired by 10 high-speed cameras. is not straightforward, however. The CMU dataset[8]
We also provide 3D full body models of all subjects in contains a diverse collection of human poses, yet these
the dataset, acquired with an accurate 3D laser scanner. are not synchronized with the image data, making end to
Evaluation Benchmarks, Complex Backgrounds, Oc- end training and performance evaluation difficult. Due
clusion: The dataset provides not only training, vali- to difficulties in obtaining accurate 3D pose information
dation and testing sources for the data collected in the with synchronized image data, evaluations were initially
laboratory, but also a variety of mixed-reality settings qualitative. Quantitative evaluations were pursued later
where realistic graphical characters have been inserted in using graphic renderings of synthetic models [29], [30],
video environments collected using real, moving digital [12]. The release of high-quality synchronized data in the
cameras, and animated using our motion capture data. HumaEva benchmark [4] has represented a significant
The insertions and occlusions are geometrically correct, step forward, but its size and pose diversity remain
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 3

Fig. 1. A real image showing multiple people in different poses (left), and a matching sample of our actors in similar poses (middle)
together with their reconstructed 3D poses from the dataset, displayed using a synthetic 3D model (right). The desire to cover the
diversity of 3D poses present in such real-world environments has been one of our motivations for the creation of Human3.6M.

somewhat small. Commercial datasets of human body and its design considerations.
scans like CAESAR are comprehensive and offer sta-
tistically significant body shape variations of an entire 2.1 Experimental setting
population, but provide no motion or corresponding
image data for the subjects involved. An ambitious effort Our laboratory setup, represented in figure 2(c), lets us
to obtain 3D pose information by manual annotation capture data from 15 sensors (4 digital video cameras,
was pursued in [31], although the 2D and 3D labelings 1 time-of-flight sensor, 10 motion cameras), using hard-
are only qualitative, and the size of the gathered data ware and software synchronization (see 2(b) for details).
is small: 1000 people in 300 images. Approaches to 2D The designated laboratory area is about 6m x 5m, and
human body localization have also been pursued [32], within it we obtain a region of approximately 4m x 3m of
[33], [34], [31]. For 2D pose estimation, ground truth effective capture space, where subjects were fully visible
data can be obtained by simply labeling human body in all video cameras. Digital video (DV) cameras (4 units)
parts in the image. Existing datasets include stickmen are placed in the corners of the effective capture space. A
annotations [33], [35] and extensions of poselets with 2D time-of-flight sensor (TOF) is also placed on top of one of
annotations [36]. the digital cameras. A set of 10 motion capture (MoCap)
The advances on various fronts, 2D and 3D, both cameras are rigged on the walls to maximize the effective
in terms of methodology and data availability, have experimentation volume, 4 on each left and right edge
motivated the recent interest towards realistic 3D human and 2 roughly mid-way on the horizontal edges. A 3D
motion capture in natural environments[37], [6], [38], laser body scanner from Human Solutions (Vitus LC3)
[7]. Very encouraging results have been obtained, but was used to obtain accurate 3D volumetric models for
there are still challenges that need to be solved before each of the actors participating in the experiments.
the technology will enable the acquisition of millions The 3D motion capture system relies on small reflec-
of human poses. Recent interest in 3D motion capture tive markers attached to the subject’s body and tracks
technologies has been spurred by the public availability them over time. Tracking maintains the label identity
of time-of-flight, infrared or structured light sensors [39], and propagates it through time from an initial pose
[40]. The most well-known of these, the Kinect system, which is labeled either manually or automatically. A
represents a vivid illustration of a successful real-time fitting process uses the position and identity of each of
pose estimation solution deployed in a commercial set- the body labels, as well as proprietary human motion
ting. Its performance is in part due to a large scale models, to infer accurate pose parameters.
training set of roughly 1 million pose samples, which
remains proprietary, and in part due to the availability 2.2 Dataset Structure
of depth information that simplifies the segmentation of In this section we describe the choice of human motions
the person from its surroundings, and limits 3D inference captured in the dataset, the output data types provided
ambiguities for limbs. By its size and complexity Hu- as well as the image processing and input annotations
man3.6M is meant to provide the research community that we pre-compute.
with data necessary to achieve similar performance in Actors and Human Pose Set: The motions in the dataset
the arguably more difficult case of only working with were performed by 11 professional actors, 5 female and 6
intensity images, or alternatively–through our time-of- male, chosen to span a body mass index (BMI) ranging
flight data–, in similar setups as Kinect, by means of from 17 to 29. We have reserved 7 subjects, 3 female
open access and larger and more diverse datasets. and 4 male, for training and validation, and 4 subjects
(2 female and 2 male) for testing. This choice provides
2 DATASET C OLLECTION AND D ESIGN a moderate amount of body shape variability as well as
In this section we describe the capture space and the different ranges of mobility. Volumetric information in
recording conditions, as well as our dataset composition the form of 3D body scans was gathered for each actor
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 4

MoCap System DV System


Type of action Scenarios Train Validation Test No x Sensor 10 x Vicon T40 No x Sensor 4 x Basler piA1000
Upper body Directions 83,856 50,808 114,080 Resolution 4 Megapixels Resolution 1000x1000
movement Discussion 154,392 68,640 140,764 Freq. 200Hz Freq. 50Hz
Full body upright Greeting 69,984 33,096 84,980 Sync hardware Sync hardware
variations Posing 70,948 25,800 85,912 TOF System Body Scanner
Purchases 49,096 33,268 48,496
No x Sensor 1 x Mesa SR4000 Sensor Vitus Smart LC3
Taking Photo 67,152 38,216 89,608
Resolution 176x144 No. Lasers 3
Waiting 98,232 54,928 123,432
Freq. 25Hz Point Density 7dots/cm3
Walking variations Walking 114,468 47,540 93,320 Sync software Tolerance < 1mm
Walking Dog 77,068 30,648 59,032
Walking Pair 76,620 36,876 52,724 (b) Technical summary of our different sensors.
Variations while Eating 109,360 39,372 97,192
seated on a chair Phone Talk 132,612 39,308 92,036
Sitting 110,228 46,520 89,616
Smoking 138,028 50,776 85,520
Sitting on the floor Sitting Down 112,172 50,384 105,396
Various Movements Miscellaneous - - 105,576
Total 1,464,216 646,180 1,467,684
(a) The number of 3D human poses in Human3.6M in training,
validation and testing aggregated over each scenario. We used 5
subjects for training (2 female and 3 male), 2 for validation (1 female
and 1 male) and 4 subjects for testing (2 female and 2 male). The
number of video frames is the same as the number of poses (4
cameras capturing at 50Hz). The number of TOF frames can be
obtained by dividing the table entries by 8 (1 sensor capturing at
25Hz).
(c) Floor plan showing the capture region and the
placement of the video, MoCap and TOF cameras.

Fig. 2. Overview of the data and the experimental setup. (a) Number of frames in training, validation and testing by scenario. (b)
Technical specification of our sensors. (c) Schema of our capture space and camera placement.

Fig. 3. A sample of the data provided in our dataset from left to right: RGB image, person silhouette (bounding box is also
available), time-of-flight (depth) data (range image shown here), 3D pose data (shown using a synthetic graphics model), accurate
body surface obtained using a 3D laser scanner.

to complement the joint position information alone. This types of asymmetries (e.g. walking with a hand in a
data can be used also to evaluate human body shape pocket, walking with a bag on the shoulder), sitting and
estimation algorithms[24]. The meshes are released as lying down, various types of waiting poses and so on.
part of the dataset. The subjects wore their own regular The structure of the dataset is shown in table 2(a).
clothing, as opposed to special motion capture outfits, to
maintain as much realism as possible. The actors were Joint Positions and Joint Angle Skeleton Representa-
given detailed tasks and were shown visual examples tions: Common pose parametrizations considered in the
(images of people) in order to help them plan a stable literature include relative 3D joint positions (R3DJP) and
set of poses for the creation of training, validation and kinematic representation (KR). Our dataset provides data
test sets. However, when executing these tasks, the actors in both parametrizations, with a full skeleton containing
were given quite a bit of freedom to move naturally the same number of joints (32) in both cases. In the
instead of being forced into a strict interpretation of the first case (R3DJP), the joint positions in a 3D coordinate
motions or poses corresponding to each task. system are provided. The data is obtained from the joint
angles (provided by Vicon’s skeleton fitting procedure)
The dataset consists of 3.6 million different human by applying forward kinematics on the skeleton of the
poses collected with 4 digital cameras. Data is organized subject. The parametrization is called relative because
into 15 training scenarios including walking with many there is a specially designated joint, usually called the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 5

Fig. 4. Sample images from our dataset, showing the variability of subjects, poses and viewing angles.

Fig. 5. High resolution meshes (body scans) of the actors involved in our experiments, illustrating the body shape variations in the
dataset.

root (roughly corresponding to the pelvis bone position), the pose coordinates to map the 4 DV cameras into a
which is taken as the center of the coordinate system, unique coordinate system (we also provide code for this
while the other joints are estimated relative to it. The data manipulation). As seen in table 2(b), poses from
kinematic representation (KR) considers the relative joint motion capture are also available at (4-fold) faster rates
angles between limbs and is more convenient because compared to the images from DV cameras. Our code also
it is invariant to both scale and body proportions. The provides the option to double both the image and the 3D
dependencies between variables are, however, much pose data by generating their mirror symmetries. This
more complex, making estimation more difficult. The procedure can yield 7 million images with corresponding
process of estimating the joint angles involves non-linear 3D poses.
optimization under joint limit constraints. Image Processing, Silhouettes and Person Bounding
Boxes: Pixel-wise, figure-ground segmentations for all
We devoted significant efforts to ensure that the data images were obtained using background models. We
is clean and the fitting process accurate, by also mon- trained image models as mixtures of Gaussian distri-
itoring the image projection errors of body joint posi- butions in each of the RGB and HSV color channels
tions. These positions were obtained based on forward as well as the gradient in each RGB channel1 (total of
kinematics, after fitting, and compared against image 3+3+2x3=12 channels). We used the background models
marker tracks. Outputs were visually inspected multi- in a graph cut framework to obtain the final figure-
ple times, during different processing phases, to ensure ground pixel labeling. The weights of the input features
accuracy. These representations can be directly used in for the graph cut model were learned by optimizing
independent monocular predictions or in a multi camera a measure of pixel segmentation accuracy on a set of
estimation setting. The monocular prediction dataset can
be increased 4-fold by globally rotating and translating 1. Note that gradients are stable because the cameras are fixed.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 6

manually labeled ground truth silhouettes for a subset of we do in this work.2 The poses used for animating
images sampled from different videos. The segmentation the models were selected directly from our laboratory
measure we used was the standard overlap, expressed test set. The Euler ZXY joint angles extracted by the
as pixel-wise intersection over the union between the motion capture system were used to create files where
hypothesis and the ground-truth silhouette. A Nelder- limb lengths were matched automatically to the models.
Mead optimization algorithm was used in order to han- The limb lengths were necessary in the next step, where
dle non-smooth objectives. we retargeted the captured motion data to the skeletons
of the graphics models, using animation software. The
The dataset also provides accurate bounding box an-
actual insertion required solving for the (rigid) camera
notations for people. This data was obtained by project-
motion, as well as for its internal parameters[45], for
ing the skeleton in the image and fitting a rectangular
good quality rendering. The exported camera tracks as
box around the projection. For accurate estimates, a
well as the model were then imported into animation
separate camera calibration procedure was performed to
software, where the actual rendering was performed.
improve the accuracy of the default one provided by the
The scene was set up and rendered using the mental
Vicon system. This extra calibration is necessary because
ray, ray-tracing renderer, with several well-placed area
the camera distortion parameters are not estimated by
lights and skylights. To improve quality, we placed a
the default calibration method. The calibration data is
transparent plane on the ground, to receive shadows.
also provided with the release of the dataset. It was
Scenes with occlusion were also created. The dataset
obtained by positioning 30 reflective markers on the
contains 5 different dynamic backgrounds obtained with
capture surface and by manually labeling them in each of
a moving camera, a total of 7,466 frames, out of which
the cameras with subpixel accuracy. Models with second
1,270 frames contain various degrees of occlusion. A
order radial distortion parameters were fitted to this
sample of the images created is shown in fig. 6. We
data, separately for each of the four DV cameras. This
see this component of Human3.6M as a taster. Given
procedure resulted in significantly improved calibration
the large volume of motion capture data we collected,
parameters, with .17 pixels mean re-projection error.
we can easily generate large volumes of mixed reality
Additional Mixed Reality Test Data: Besides creating video with people having different body proportions and
laboratory test sets, we also focused on providing test with different clothing, and against different real static
data to cover variations in clothing and complex back- or moving backgrounds, for both training and testing.
grounds, as well as camera motion and occlusion (fig.
6). We created the mixed reality videos by inserting high 3 L ARGE S CALE P OSE E STIMATION M ODELS
quality 3D rigged animation models in real videos with
realistic and complex backgrounds, good quality image We provide several large scale evaluation models with
data and accurate 3D pose information. The movies our dataset and we focus on automatic discriminative
were created by inserting and rendering 3D models of a frameworks due to their conceptual simplicity and po-
fully clothed synthetic character (male or female) in real tential for scalability. The estimation problem is framed
videos. We are not aware of any setting of this level of as learning a mapping (or an index) from image descrip-
difficulty in the literature. Real images may show people tors extracted over the person silhouette or its bounding
in complex poses, but the diverse backgrounds as well box, to the pose represented based on either joint posi-
as the scene illumination and the occlusions can vary tions or joint angles. Let Xi be the image descriptor for
independently and represent important nuisance factors frame i, Yi the pose representation for frame i, and f
the vision systems should be robust against. Although (or fW ) the mapping with parameters W. Our goal is
approaches to offset such nuisance factors exist in the to estimate a model with fW (X) ≃ Y, for X and Y not
literature, it is difficult to evaluate their effectiveness seen in training. Specifically, the methods we considered
because ground truth pose information for real images are: k-nearest neighbor (kNN), linear and kernel ridge re-
is hard to obtain. Our dataset features a component that gression (LinKRR, KRR), as well as structured prediction
has been especially designed to address such hard cases. methods based on kernel dependency estimation (KDE)
This is not the only possible realistic testing scenario [46], [12], where, for scalability reasons, we used Fourier
– other datasets in the literature [7], [6] also contain kernel approximations [47], [48], [38]. Training such
realistic testing scenarios such as different sport motions models (or any other human pose prediction method,
or backgrounds. Prior efforts to create mixed reality se- for that matter) using millions of examples is highly
tups for training and testing 3D human pose estimation non-trivial and has not been demonstrated so far in the
methods exist, including our own prior work [41], [42] context of such a continuous prediction problem, with
but also [43] and more recently [44]. However, none of structured, highly correlated outputs.
the prior work datasets were sufficiently large. Perhaps k-Nearest neighbor regression (kNN) is one of the
more importantly, the insertion of the 3D synthetic hu- simplest methods for learning f [49]. ‘Training’ implies
man character was not taking into account the geometry
2. The insertion process involved the composition of the character
of the camera that captured the background and the silhouette sprite with the image background, with all the 3D geometric
one of the 3D scene (e.g. ground plane, occluders), as inconsistency and the image processing artifacts this can lead to.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 7

Fig. 6. Sample images from our mixed reality test set. The data is challenging due to the complexity of the backgrounds, viewpoints,
diverse subject poses, camera motion and occlusion.

storing all examples or a subset of them, in our case, due set size, because of a n × n matrix inversion. In our
to running time constraints. Depending on the distance experiments, we choose χ2 as our input metric and
function used, an intermediate data structure, typically the exponential map to transform the metric into a
KD or cover trees [50], can be constructed during train- kernel, i.e. k(Xi , Xj ) = exp(−βχ2 (Xi , Xj )), where β is
ing, in order to speed-up inference at test time. These a scale parameter. This kernel is called the exponential-
data structures, however, are dependent on the input χ2 kernel in thePliterature. Prediction is done using the
metric and pay off mostly for problems with low input rule fα,β (X) = i αi k(X, Xi ).
dimensionality, which is not our case. As inputs we Fourier Embeddings for Kernel Approximation. Our
use the χ2 comparison metric for histograms, which is large scale non-linear prediction approach relies on
known to perform well on gradient distributions (we methods to embed the data into an Euclidean space
use pyramids of SIFT grids extracted over the person using an approximate mapping derived from the Fourier
silhouette or bounding box). For vectors X = [x1 . . . xd ] transform of the kernel. The procedure [52], [48], relies
and Y = [y1 . . . yd ], the χ2 distance is defined as on a theorem, due to Bochner, that guarantees the ex-
s istence of such a mapping for the class of translation
1 X (xl − yl )2
χ2 (X, Y) = (1) invariant kernels. This class contains the well-known and
d xl + yl widely used Gaussian and Laplace kernels. The idea is to
l
approximate a potentially infinite-dimensional or analyt-
In order to be able to run experiments within a reason-
ically unavailable kernel lifting with a finite embedding
able amount of time for certain non-approximated mod-
that can be computed explicitly. The approximation can
els, we had to work with only 400K training examples,
be derived as an expectation in the frequency domain
and subsample the data whenever this upper bound has
of a feature function φ which depends on the input.
been exceeded. In the experiments we used k = 1. In this
The expectation is computed using a density µ over
case prediction is made by returning the stored target
frequencies, which is precisely the Fourier transform of
corresponding to the closest example from the training
the kernel k
set under the input metric. Z
Kernel ridge regression (KRR) is a simple and reliable k(Xi , Xj ) ≃ (φ(Xi ; ω)φ(Xj ; ω))µ(ω) (4)
kernel method [51] that can be applied to predict each ω
pose dimension (joint angles or joint positions) indepen- The existence of the measure is a key property be-
dently, with separately trained models. Parameters αi cause it allows an approximation of the integral with
for each model are obtained by solving a non-linear l2 a Monte Carlo estimate, based on a finite sample from
regularized least-squares problem: µ. We therefore obtain not only an explicit represen-
1X X tation of the kernel – which is separable in the in-
arg min k αi k(Xj , Xi ) − Yj k22 + λkαk22 (2)
α 2 puts, i.e., k(Xi , Xj ) ≃ Φ(Xi )Φ(Xj )⊤ , with Φ(Xi ) =
j i
[φ(Xi ; ω1 ) . . . φ(Xi ; ωD )] a vector of the φ(Xi ; ω), and ω
The problem has a closed form solution being D samples from µ(ω) –, but at the same time we
benefit from a kernel approximation guarantee, which
α = (K + λI)−1 Y (3)
is independent of the learning cost. The explicit Fourier
with Kij = k(Xi , Xj ) and Y = [y1 , . . . yn ]. The weak- feature map can then be used in conjunction with linear
ness of the method is the cubic scaling in the training methods for prediction.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 8

Linear approximations for kernel ridge regression where the original output resides. This operation re-
(LinKRR) can be used to overcome the cubic compu- quires solving the pre-image problem [55]
tational burden of KRR while maintaining most of its
non-linear predictive performance. Using the Fourier arg min kΦ(X)W − ΦP CA (Y)k22 (7)
Y
representation and standard duality arguments, one can For certain classes of kernels, pre-images can be com-
show that equation (2) is equivalent to puted analytically, but for most kernels exact pre-image
1X maps are not available. The general approach is to opti-
arg min kΦ(Xi )W − Yi k22 + λkWk22 (5) mize (7) for the point Y in the target space whose KPCA
W 2 i
projection is closest to the prediction given by the input
This is a least squares regression model applied to non- regressor. This is a non-linear, non-convex optimization
linearly mapped data which has a closed form solution problem, but it can be solved quite reliably using gra-
dient descent, starting from an initialization obtained
W = (Φ(X)⊤ Φ(X) + λID )−1 Φ(X)⊤ Y (6) from independent predictors on the original outputs.
This process can be viewed as inducing correlations by
A matrix inversion needs to be performed in this case
starting from an independent solution.
as well, but this time the dimension of the matrix is
In this case, we apply the Fourier kernel approxi-
D × D, with the input size typically much smaller than
mation methodology to both covariates and targets, in
the training set (D ≪ n). The inversion is independent
order to obtain a very efficient structured prediction
of the number of examples, which makes LinKRR an
method. Once the Fourier features of the targets are com-
attractive model for large scale training. The construction
puted, only their dimensionality influences the complex-
of the matrix Φ(X)⊤ Φ(X) is a linear operation in the
ity needed to solve for kernel PCA, and training becomes
dimension of the training set n and can be computed
equivalent to solving a ridge regression problem to these
online with little memory consumption. Note that D
outputs. The resulting method is very efficient, and does
is a parameter for the method and allows the trade-
not require sub-sampling the data.
off between efficiency (larger D makes inversion more
demanding) and performance (larger D makes the ap-
proximation more accurate). The experimental results 4 E VALUATION AND E RROR M EASURES
show that often, when D is large enough, there is little We propose several different measures to evaluate per-
or no performance loss for many interesting kernels. To formance. Each has advantages and disadvantages, so
make this equivalent to the exact KRR method, we use an we evaluate and provide support for all of them in order
exponential-χ2 kernel approximation proposed by [48]. to give a more comprehensive picture of the strengths
Kernel Dependency Estimation (KDE). We also con- (f )
and weaknesses of different methods.3 Let mf ,S (i) be
sidered large-scale structured prediction models first a function that returns the coordinates of the i-th joint
studied in a different pose estimation context by Ionescu of skeleton S, at frame f , from the pose estimator f .
et al. [38]. The models leverage the Fourier approxi- (f )
Let also mgt,S (i) be the i-th joint of the ground truth
mation methodology for kernel dependency estimation frame f . Let S be the subject specific skeleton and Su
(KDE) [46]. Standard multiple output regression mod- be the universal skeleton. The subject specific skeleton
els treat each dimension independently, thus ignoring is the one whose limb lengths correspond to the subject
correlations between targets. In many cases this simple performing the motion. The universal skeleton has one
approach works well, but for 3D human pose estimation set of limb lengths, independent of the subject who
there are strong correlations between the positions of performed the motion. This allows us to obtain data in
the skeleton joints due to the physical and anatomical a R3DJP parametrization, which is invariant to the size
constraints of the human body, the environment where of the subject.
humans operate, or the structure and synchrony of many MPJPE. Much of the literature reports mean per joint
human actions and activities. One possibility to model position error. For a frame f and a skeleton S, MPJPE
such dependencies is to first decorrelate the multivariate is computed as
output through orthogonal decomposition [53] based
NS
on Kernel Principal Component Analysis (KPCA) [54]. 1 X (f ) (f )
KPCA is a general framework covering both parametric EMP JP E (f, S) = kmf ,S (i) − mgt,S (i))k2 (8)
NS i=1
kernels and data-driven kernels corresponding to non-
linear manifold models or semi-supervised learning. The where NS is the number of joints in skeleton S. For a
space recovered via kernel PCA gives an intermediate, set of frames the error is the average over the MPJPEs
low dimensional, decoupled representation of the out- of all frames.
puts, and standard KRR can now be used to regress Depending on the evaluation setup, the joint coor-
on each dimension independently. To obtain the final dinates will be in 3D, and the measurements will be
prediction, one needs to map from the orthogonal space
3. As the field evolves towards agreement on other metrics, not
obtained using Kernel PCA, and where independent present in our dataset distribution, we plan to implement and provide
KRR predictions are made, to the (correlated) pose space evaluation support for them as well.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 9

reported in millimeters (mm), or in 2D, where the error perceptual error measures beyond the ones we explore
will be reported in pixels. For systems that estimate joint here can be envisaged. They could encode the contact
angles, we offer the option to automatically convert the between the person and its environment, including in-
angles into positions and compute MPJPE, using direct teraction with objects or contact with the ground plane.
kinematics on the skeleton of the test subject (the ground Our dataset does not contain people-object interactions
truth limb lengths will not be used within the error but the ground plane can be easily recovered. Alterna-
calculation protocol). tively, in order to better understand what represents a
One of the problems with this error measure is its good perceptual threshold, one can explore the degree
subject specificity. Since many methods may encounter to which people can re-enact (reproduce) a variety of
difficulties in predicting the parameters of the skele- human poses shown to them, in different images. This
ton (e.g. limb lengths), we propose a universal MPJPE methodology is pursued in our recent work [56], and we
measure which considers the same limb lengths for all refer the interested reader to it for details.
subjects, by means of a normalization process. We denote
this error measure UMPJPE. 5 E XPERIMENTAL A NALYSIS
MPJAE. A different approach to compare poses would
be to use the angles between the joints of the skeleton. 5.1 Data Analysis
We offer this possibility for methods that predict the pose This dataset places us in the unique position of having
in a joint angle parametrization. We call this error mean both large amounts of data gathered from relatively
per joint angle error (MPJAE). The angles are computed unconstrained actors, with regard to stage direction,
in 3D. and high accuracy 3D ground-truth information. In this
3N
section we use this ground-truth information to try to
1 XS
|(mf ,S (i)−mgt,S (i)) mod ±180| gain insight into the diversity and repeatability of the
(f ) (f )
EMP JAE (f, S) =
3NS i=1 poses contained in the dataset we have captured. We
(9) also take advantage of the data annotation in order to
In this case, the function m returns the joints angles easily assess the occurrence of certain visual phenomena
instead of joint positions. This error is relatively unin- such as foreshortening, ambiguities and self-occlusions.
tuitive since, perceptually, not all errors should count Diversity. An easy way to assess the diversity of our
equally. If one makes a 30◦ error in predicting the elbow, data is to check how many distinct poses have been
only one joint, the wrist, is wrongly predicted but a 30◦ obtained. We consider two poses to be distinct, if at
error in the global rotation will misalign all joints. least one joint is different than the corresponding joint
MPJLE. The two previously proposed error measures, from the other pose, beyond a certain tolerance t i.e.
MPJPE and MPJAE, have two disadvantages. One issue maxi km1 (i) − m2 (i)k2 > t. Since our goal is to provide
is that they are not robust – one badly predicted joint not only pose, but also appearance variations, poses of
can have unbounded impact on the error of the entire different subjects are considered different, independently
dataset. Secondly, errors that are difficult to perceive by of how similar they are in 3D. This experiment reveals
humans can be overemphasized in the final result. that for a 100mm tolerance, 12% of the frames are distinct
To address some of these observations, we propose a for a total of about 438,654 images. These figures grow
new error measure, the mean per joint localization error, to 24% or 886,409 when the tolerance is down to 50mm.
that uses a perceptual tolerance parameter t Repeatability. Pose estimation from images is a diffi-
cult problem because appearance varies not only with
NS
1 X pose, but also with a number of “nuisance” factors like
EMP JLE@t (f, S) = 1km(f ) (i)−m(f ) (i)k2 ≥t (10) body shape and clothing. One way to deal with this
NS i=1 f ,S gt,S

problem is to isolate pose variation from all the other


This error measure can be used by fixing the tolerance factors by generating a dataset of pairs of highly similar
level using a perceptual threshold. For instance, errors poses originating from different subjects (see figure 8 for
below a couple of centimeters are often perceptually examples, as well as early work on learning distance
indistinguishable. Alternatively, errors corresponding to functions that preserve different levels of invariance
different tolerance levels can be plotted together. By in a hierarchical framework [42]). We compare poses
integrating t over an interval, say [0, 200], we can obtain using the distance between the most distant joints with
an estimate of the average error in the same way mean a threshold at 100mm. Note that whenever a pair of
average precision gives an estimate of the performance similar poses is detected, temporally adjacent frames are
of a classifier. This error can be used also for evaluating also very similar. We eliminate these redundant pairs
a pose estimator that may not predict all joints, and such by clustering them in time and picking only the most
an estimator will be penalized only moderately. A related similar as the representative pair. In the end, we obtain a
approach based on PCP curves has been pursued, for dataset with 10,926 pairs, half of which are coming from
2d pose estimation, in [35]. Here we differ in that we the “Discussion”,“Eating” and “Walking” scenarios.
work in 3D as opposed to 2D, and we consider the joints Foreshortening. To assess the occurrence of foreshorten-
independently as opposed to pairwise. More complex ing we consider the projections for the 3 joints of one
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 10

Fig. 7. Examples of visual ambiguities naturally occurring in our dataset, and automatically identified. The first two examples show
foreshortening events for the left and right arm respectively. The next four images show a “Type 1” ambiguity for the right and left
arm. The last two images show one “Type 2” ambiguity event: first the ambiguous leg pose and then the pose with respect to which
the ambiguity was detected.

Fig. 8. Illustration of additional annotations in our dataset. The first 4 images show 2 pairs of consistent poses found from different
subjects. The last 3 images are obtained using 3D pose data. We render a simple geometric model and project it in the image to
obtain additional annotation: joint visibility, body part support in the image and even rough pixel-wise 3D information. This allows
us to detect joint occlusion, partial limb occlusion and obtain depth discontinuities.

limb (shoulder, elbow and wrist for the arms and hip, angle limits or the body non self-intersection constraints,
knee and ankle for the legs). We claim that such an event to recover the original 3D pose from its projection,
has happened when all these projections are very close to and the ambiguities may persist temporally [57]. The
each other and the depth ordering of the joints is correct. existence of monocular 3D ambiguities is well known
In our experiments we calibrate a 20 pixel tolerance in [18], [57] but it is interesting to study to what extent
the center of the capture surface and normalize it using these are present among the poses of a large, ecological
the distance to the camera. We observe that although dataset. We can assess the occurrence of ambiguities by
these events are rare in our dataset, they do happen. looking at 3D and 2D ground truth pose information.
After clustering and removing redundant events, we We separate ambiguity events in two types. “Type 1”
counted 138 foreshortening events for arms and 12 for (T1) is an ambiguity that occurs at the level of one limb.
legs in the training data, and 82 and 2 respectively, We consider that two poses for a limb are ambiguous
for the test data. Since most of our scenarios contain if the projections are closer than a threshold d2D while
mainly standing poses, foreshortening happens mostly the MPJPE is larger than some distance d3D . In our
for arms, although for the “Sitting” and “Sitting Down” experiment we use d2D = 5 pixels and d3D = 100
scenarios they occur for legs as well (14 occurrences in mm. These thresholds provide a large number of pairs
total). As one might expect, the “Directions” scenario, of frames, many of which are consecutive. For a result
where the subject points in different directions, has the that is easier to interpret, we group the pairs using their
most foreshortening occurrences (60 in total), while some temporal indices and keep only one example per group.
scenarios like “Smoking” had none. The second type of ambiguity occurs between two limbs
of the same type, i.e. arms or legs. If for two different
Ambiguities. When predicting 3D human pose from poses the projections of the joints of one limb are close to
static images we are inverting an inherently lossy non- the projections of those of another limb corresponding to
linear transformation that combines perspective projec- the second pose, while the poses are still relatively con-
tion and kinematics[18], [19]. This ambiguity makes it sistent, i.e. MPJPE is not too large, then we consider to
difficult, in the absence of priors other than the joint
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 11

have a “Type 2” (T2) ambiguity. The constraint on MPJPE 200 KNN


is added to remove forward-backward flips which are KRR
the most likely cause of similar projections for different LinKRR
limbs. Examples of both types of ambiguities are given 180
LinKDE

Error (mm)
in figure 7. Notice however that the presence of a subset
of ambiguities in our captured data does not imply that 160
such ambiguities and their number would immediately
correlate to the ones obtained by an automatic monoc-
ular pose estimation system–we know that a larger set 140
of geometric ambiguities exists, ex-ante. The question is
to what extent the pose estimation system can be made 120 3 4 5 6
‘not see them’ using image constraints, prior knowledge, 10 10 10 10
or information from the environment and the task. The Training set size
results are summarized in table 1.
Fig. 9. The evolution of the test error as a function of the train-
T1LArm T1RArm T1LLeg T1RLeg T2Legs ing set size. Larger training sets offer important performance
Directions 1235 2512 3022 3219 425 benefits for all models we have tried.
Discussion 7501 9503 8226 6167 881
Eating 2451 3130 3277 3100 175
Greeting 1507 2099 2392 2066 228 bins. Variants of these features have been shown to work
Phone Talk 2255 3154 3316 3045 191
Posing 1767 2468 2431 2145 117 well on previous datasets (e.g. HumanEva [4])[41], [14]
Buying 922 1311 1205 962 96 and we show that these are quite effective even in this
Sitting 2398 3220 3508 3693 4
Sitting Down 2200 2996 3270 3407 66 more complex setting. Since both background subtrac-
Smoking 2109 3574 3660 3320 232 tion (BS) and bounding box (BB) localization of our
Taking Photo 1096 1407 1831 1611 109
Waiting 2893 3820 4387 3353 265 subjects are provided, we performed the experiments
Walking 2407 3017 3266 2225 965 using both features extracted over the entire bounding
Walking Dog 1142 1395 1592 1468 298
Walking Pair 925 1406 1828 1778 366 box, and using descriptors where the BS mask is used,
in order to filter out some of the background.
TABLE 1
A summary of the results for our type 1 (T1) and type 2 (T2) Pose Data. Our 3D pose data is mapped to the coordi-
ambiguity experiments showing counts of distinct ambiguity nate system of one of the 4 cameras, and all predictions
events by scenario, in our dataset. are performed in that coordinate system. When we re-
port joint position errors, the root joint of the skeleton
Self-occlusion. Unlike 2D pose estimation datasets, our is always in the center of the coordinate system used
data does not directly provide information about joint for prediction. Errors are reported mostly in mm using
and limb visibility. This can be computed using the avail- MPJPE. Sometimes we use MPJAE which reports errors
able data by considering the 3D pose information and in angle degrees. Human poses are represented using a
using it to render a simple geometric model which can skeleton with 17 joints. This limitation of the number
then be projected onto the image. In figure 8, we show of joints helps discard the smallest links associated to
examples of the joint locations from which body part label details for the hands and feet, going as far down the
visibility can be easily obtained. Moreover, we can obtain kinematic chain to only reach the wrist and the ankle
dense part labels from which part visibility can be derived joints.
and depth information which can be used to label depth Training set size. We first studied the manner in which
discontinuity edges. All these detailed annotations are test errors vary with the size of the training set (fig. 9).
very difficult to obtain in general, and are highly relevant Due to the limited memory and computational resources
in the context of human pose estimation. available, the largest exact KRR model was learnt using
50,000 samples, and the largest kNN model was based
on 400,000 samples. The results show that our best
5.2 Prediction Experiments performing approximate non-linear model, LinKDE, cap-
In this section we provide quantitative results for sev- italizes on a 2 orders of magnitude increase in training
eral methods including nearest neighbors, regression set size by reducing the test error by roughly 20%.
and large-scale structured predictors. Additionally, we This experiment clearly shows the potential impact of
evaluate subject and activity specific models, as well as using Human3.6M in increasing the accuracy of pose
general models trained on the entire dataset. We also estimation models. In fact, the potential for research
study the degree of success of the methodology in more progress is significantly vaster, as more sophisticated
challenging situations, like the ones available in our models with increased capacity, beyond our baselines
mixed reality dataset. here, can be used, with two orders of magnitude more
Image Descriptors. For silhouette and person bounding data than in the largest available dataset.
box description, we use a pyramid of grid SIFT descrip- Results. Linear Fourier methods use input kernel em-
tors with 3 levels (2x2, 4x4 and 8x8) and 9 orientation beddings based on 15,000-dimensional random feature
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 12

1 1
maps (corresponding to exponentiated χ2 ), and 4000- KNN(132.18) KNN(182.59)

Average Localization Error

Average Localization Error


KRR(118.60) KRR(151.73)
0.8
d output kernel embedding (corresponding to Gaussian LinKRR(123.89) 0.8 LinKRR(162.51)
LinKDE (93.01)
kernels). Typical running times on an 8 core PC with the 0.6 LinKDE (137.98)
0.6
full dataset include 16 hours for testing kNN models 0.4

(with a training set subsampled to 300K examples), 1h 0.2


0.4

for training and 12h for testing KRR (40K example train-
0 0.2
ing set). For the full training set of 2.1 million examples 0 50 100
Tolerance
150 200 0 50 100
Tolerance
150 200

where only linear approximations to non-linear models


can be effectively applied, training LinKRR takes 5h and Fig. 10. The proposed perceptual MPJLE error measure. In
testing takes about 2h. LinKDE takes about 5h to train the left plot we show the MPJLE for the ‘Eating’ ASM. We
and 40h to test. Code for all of the methods is provided compare all our predictors, where features are computed on
on our website as well. input segments given by background subtraction. The right
Several training and testing model scenarios were plot shows similar results for the background subtraction GM.
MPJPE errors for each model are given in parentheses. The
prepared. The simplest one considers data from each results are computed using the full test set (including subject
subject separately (we call this Subject Specific Model or S10).
SSM). The motions for our 15 scenarios are each captured
in 2 trials which are used for training and validation, in this case a chair. The ‘taking photo’ and ‘walking
respectively. A set of 2 motions from each subject were dog’ motions are also difficult because of bounding
reserved for testing (these were data captured in distinct box variations, and because they are less repeatable
motions performed by the subjects, not subsampled from and more liberty was granted to the actors performing
single training sequences). The setup includes different them. Overall, we feel that the dataset offers a good
poses that appear in the 15 training scenarios (one balance between somewhat ‘easier’ settings, as well as
involves sitting, the second one does not). This type of moderately difficult and challenging ones, making it
experiment was designed to isolate the pose variability a plausible benchmark for testing new and improved
from the body shape and clothing variability. A second, features, models or algorithms.
more challenging scenario, considers prediction with a Due to certain privacy concerns, we have decided
model trained on a set of 7 fixed training subjects (5 to withhold the images of one of our testing subjects,
for training, and 2 for validation) and tested on the S10. However, we make all the other data associated
remaining 4 subjects on a per motion basis (we call it to this subject available, including silhouettes, bounding
Activity Specific Model or the ASM). Finally, we used boxes as well as corresponding image descriptors. In
a setup where all motions are considered together using this article we report results for both ASM and GM
the same split among subjects (our General Model, GM). models, including S10 in the test set (tables 3 and 4). In
We first tested the baseline methods on the simplest the future, as other features are developed by external
setup, SSM. We noticed a difference between results researchers or by us, we will strive to compute those
obtained using background subtraction (BS) and bound- and make them available for download for S10, too. Our
ing box (BB) inputs with a slight edge to BB. This is evaluation server allows error evaluation on the test set,
not entirely surprising when considering, for instance, both including and excluding S10.
examples involving sitting poses. There, the presence of The MPJLE measure gives insight into the success and
the chair makes background subtraction very difficult failures of our tested methods (fig. 10). For the ‘Eating’
and that affects the positioning of the object within the ASM, one of the easier activities, LinKDE correctly pre-
descriptor’s coordinate system. This problem only affects dicts, on average, 14 out of 17 joints, at 150mm tolerance.
our background subtraction data since the bounding In contrast KRR and LinKRR correctly predict only 12
boxes are computed from the joint projections alone. joints at the same level of tolerance.
In our second setup we tested our models on each mo- Our final evaluation setup is the one where models
tion separately. These are referred to as Activity Specific are trained based on all motions from all subjects. Due
Models (ASM). We noticed that errors are considerably to the size of the dataset, this is a highly non-trivial
higher both because of the large size of our test set and process and very few existing methodologies can handle
the significant subject body variation introduced. Our it. We refer to this as the general motion (GM) setup
‘sitting down’ motion is one of the most challenging. and show the results in table 4. The models we have
It consists of subjects sitting on the floor in different tested appear not to be able yet to effectively leverage
poses. This scenario is complex to analyze because of the the structure in the data, but it is encouraging that linear
high rate of self-occlusion, as well as the bounding box Fourier approximations to non-linear methods can be
aspect ratio changes. It also stretches the use of image applied on such large datasets with promising results,
descriptors extracted on regular grids, confirming that, and within a reasonable time budget. Future research
while these may be reasonable for standing poses or towards better image descriptors, improved modeling
pedestrians, they are not adequate for general human of correlations among the limbs of the human body, or
motion sensing. The other ‘sitting’ scenario in the dataset the design of large scale learning methods should offer
is challenging too due to the use of external objects, new insights into the structure of the data and should
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 13

method mask S1 S7 S8 S9 S11 S5 S6


kNN BB 118.93 129.60 74.90 113.31 127.98 132.00 155.65
kNN BS 127.91 112.19 63.27 108.68 132.96 113.65 139.35
KRR BB 99.96 96.41 58.94 95.75 106.50 108.35 117.90
KRR BS 107.96 100.66 58.19 97.73 114.84 112.18 114.60
LinKRR BB 114.98 114.30 81.69 119.55 126.35 128.00 140.86
LinKRR BS 125.46 122.89 82.09 122.29 136.81 134.84 141.01
LinKDE BB 94.07 93.63 55.32 91.80 97.25 96.35 113.80
LinKDE BS 96.13 93.51 51.95 89.54 100.96 105.89 102.74

TABLE 2
Results of the different methods, corresponding to the subject specific modeling (SSM) setup, and for all training subjects in the
dataset. kNN indicates nearest neighbor (k=1), KRR is kernel ridge regression, and LinKRR represents a linear Fourier
approximation of KRR. LinKDE is the linear Fourier approximation corresponding to a structured predictor based on Kernel
Dependency Estimation (KDE). Errors are given in mm, using the MPJPE metric.
method mask Directions Discussion Eating Greeting Phone Talk Posing Buying Sitting
kNN BB 154.23 151.18 136.23 165.94 147.51 175.58 180.82 194.09
kNN BS 166.28 163.68 132.19 188.33 145.89 199.89 174.38 174.67
KRR BB 118.96 116.77 109.65 128.51 123.05 136.23 153.55 176.90
KRR BS 130.04 124.96 118.60 140.73 125.35 152.21 157.45 159.30
LinKRR BB 123.67 121.23 116.09 136.77 130.60 142.40 165.14 180.69
LinKRR BS 136.07 132.33 123.90 149.99 132.83 158.98 162.36 168.12
LinKDE BB 115.79 113.27 99.52 128.80 113.44 131.01 144.89 160.92
LinKDE BS 124.19 117.44 93.01 138.90 111.40 145.43 136.94 139.29
method mask Sitting Down Smoking Taking Photo Waiting Walking Walking Dog Walking Pair
kNN BB 209.06 161.22 234.05 176.16 167.00 239.38 180.91
kNN BS 237.05 169.41 247.57 193.78 158.27 216.53 189.80
KRR BB 184.58 120.19 182.50 139.66 129.13 183.27 143.19
KRR BS 213.72 130.47 197.21 150.39 119.28 175.34 150.43
LinKRR BB 204.62 128.62 194.32 144.54 133.49 191.92 147.87
LinKRR BS 231.57 139.88 208.14 157.92 126.90 185.64 156.33
LinKDE BB 172.98 114.00 183.09 138.95 131.15 180.56 146.14
LinKDE BS 203.10 118.37 197.13 146.30 115.28 166.10 153.59

TABLE 3
Comparison of predictors for the activity specific setting (ASM), on the test set (including S10). kNN indicates nearest neighbor
(k=1), KRR kernel ridge regression, LinKRR is a linear Fourier approximation of KRR, and LinKDE is the linear Fourier model for a
structured predictor based on Kernel Dependency Estimation (KDE). Errors are given in mm, using the MPJPE metric.
Joint Positions Joint Angles
BB BS BB BS
kNN KRR LKRR LKDE kNN KRR LKRR LKDE kNN KRR LKRR LKDE kNN KRR LKRR LKDE
172.12 138.85 150.73 127.92 182.79 151.73 162.51 137.98 18.28 13.83 13.86 13.68 17.75 13.83 13.92 13.74

TABLE 4
Results of our GM setup, with models estimated based on data from all subjects and activities in the training set, and evaluated on
the full test set, including S10. LinKRR (LKRR) and LinKDE (LKDE) are kernel models based on random Fourier approximations
trained and tested on 2.1M and 1.4M poses respectively. The exact KRR results are obtained by using a subset of only 40,000
human poses sampled from the training set. The results for joint positions are in mm using MPJPE and the results with angles are
in degrees computed using MPJAE.

ultimately improve the 3D prediction accuracy. 6 C ONCLUSIONS


We have introduced a large scale dataset, Human3.6M
containing 3.6 million different 3D articulated poses
captured from a set of professional men and women
Mixed Reality Results. We use models trained on our actors. Human3.6M complements the existing datasets
laboratory data and test on mixed reality data. The with a variety of human poses typical of people seen in
results are given in table 5. The test videos are named real-world environments, and provides synchronized 2D
mixed-reality (MR) 1 to 7. We consider 2 scenarios: one and 3D data (including time of flight, high quality image
using the ASM of the activity from which the test video and motion capture data), accurate 3D human models
was generated and one using the GM. In this experiment (body surface scans) of the actors, and mixed reality
we use MPJAE (in degrees) and, for technical reasons, settings for performance evaluation under realistic
ignore the error corresponding to the global rotation. The backgrounds, correct 3D scene geometry, and occlusion.
ASM results are in general better than the GM results We also provide studies and evaluation benchmarks
reflecting a more constrained prediction problem. As based on discriminative pose prediction methods.
expected, BS results are better than BB results, showing Our analysis includes not only nearest neighbor or
that benefits from slightly more stable training features, standard linear and non-linear regression methods, but
observed for BB in the laboratory setting, are offset by also advanced structured predictors and large-scale
the contribution of real background features. approximations to non-linear models based on explicit
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 14

Activity Specific Model (ASM) General Model (GM)


BB BS BB BS
kNN KRR LKRR LKDE kNN KRR LKRR LKDE kNN KRR LKRR LKDE kNN KRR LKRR LKDE
MR1 25.83 20.09 19.81 19.85 21.23 18.04 18.58 18.47 24.99 20.64 20.11 19.61 22.76 19.44 18.82 18.45
MR2 20.40 16.40 17.27 16.81 19.43 16.16 16.28 15.28 20.29 18.00 18.02 16.53 19.21 16.64 16.91 15.51
MR3 24.08 20.75 21.53 21.20 25.60 22.40 21.89 21.84 23.40 20.91 21.95 21.23 25.78 21.75 22.02 21.75
MR4 25.69 19.86 20.64 20.26 22.40 19.67 20.12 19.35 26.31 21.53 20.76 19.89 23.36 20.08 19.90 19.03
MR5 19.36 17.13 17.54 17.31 19.04 16.52 16.72 16.51 21.50 20.87 21.49 19.99 25.57 20.85 22.14 19.33
MR6 20.47 18.49 19.55 18.81 20.95 18.07 18.11 17.36 26.26 22.58 22.29 20.58 23.00 20.16 20.24 18.79
MR7 19.03 16.83 17.13 14.70 18.35 13.87 15.52 13.88 20.28 19.12 17.78 16.21 20.21 19.03 18.71 16.53

TABLE 5
Pose estimation error for our mixed reality dataset, obtained with moving cameras, and under challenging non-uniform
backgrounds and occlusion (see fig. 6). The errors are computed using MPJAE and do not include the global rotation. LinKRR
(here LKRR) and LinKDE (LKDE) are linear Fourier approximation methods. The models were trained on the data captured in the
laboratory and we tested on the mixed-reality sequences. For ASM, we used the model trained on motions of the same type as
the test motion. The results are promising but also show clear scope for feature design and model improvements (the methods
shown do not model or predict occlusion explicitly).

Fourier feature maps. The ability to train complex [10] R. Rosales and S. Sclaroff, “Learning Body Pose Via Specialized
approximations to non-linear models on millions of Maps,” in Advances in Neural Information Processing Systems, 2002.
[11] A. Agarwal and B. Triggs, “Recovering 3d human pose from
examples opens up possibilities to develop alternative monocular images,” IEEE Transactions on Pattern Analysis and
feature descriptors and correlation kernels, and to test Machine Intelligence, 2006.
them seamlessly, at large scale. We show that our [12] C. Sminchisescu, A. Kanaujia, and D. Metaxas, “BM 3 E: Discrimi-
native Density Propagation for Visual Tracking,” IEEE Transactions
full dataset delivers important performance benefits on Pattern Analysis and Machine Intelligence, 2007.
compared to smaller equivalent datasets, but also that [13] L. Bo and C. Sminchisescu, “Structured Output-Associative Re-
significant space for improvement exists. The data, gression,” in IEEE International Conference on Computer Vision and
Pattern Recognition, 2009.
as well as the large-scale structure models, the image
[14] ——, “Twin Gaussian Processes for Structured Prediction,” Inter-
descriptors, as well as the visualization and software national Journal of Computer Vision, 2010.
evaluation tools we have developed are freely available [15] J. Deutscher, A. Blake, and I. Reid, “Articulated Body Motion
online, for academic use. We hope that Human3.6M Capture by Annealed Particle Filtering,” in IEEE International
Conference on Computer Vision and Pattern Recognition, 2000.
and its associated tools will stimulate further research [16] H. Sidenbladh, M. Black, and D. Fleet, “Stochastic Tracking of 3D
in computer vision, machine learning, and will help Human Figures Using 2D Image Motion,” in European Conference
in the development of improved 3D human sensing on Computer Vision, 2000.
[17] C. Sminchisescu and B. Triggs, “Covariance-Scaled Sampling for
systems that can operate robustly in the real world. Monocular 3D Body Tracking,” in IEEE International Conference on
Computer Vision and Pattern Recognition, 2001.
Acknowledgements : This work was supported in part [18] ——, “Kinematic Jump Processes for Monocular 3D Human
by CNCS-UEFICSDI, under PNII RU-RC-2/2009, PCE- Tracking,” in IEEE International Conference on Computer Vision and
2011-3-0438, and CT-ERC-2012-1. We thank our colleague Pattern Recognition, vol. 1, Madison, 2003, pp. 69–76.
Sorin Cheran, for support with the Web server. We are [19] ——, “Mapping Minima and Transitions in Visual Models,” In-
also grateful to Fuxin Li for helpful discussions and for ternational Journal of Computer Vision, vol. 61, no. 1, 2005.
experimental feedback on Fourier models. [20] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard, “Tracking
Loose-limbed People,” in IEEE International Conference on Com-
puter Vision and Pattern Recognition, 2004.
R EFERENCES [21] R. Li, M. Yang, S. Sclaroff, and T. Tian, “Monocular Tracking of 3D
Human Motion with a Coordianted Mixture of Factor Analyzers,”
[1] T. Moeslund, A. Hilton, V. Kruger, and L. Sigal, Eds., Visual
in European Conference on Computer Vision, 2006.
Analysis of Humans: Looking at People. Springer Verlag, 2011.
[2] B. Rosenhahn, R. Klette, and D. Metaxas, Eds., Human Motion, [22] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3D pose esti-
Understanding, Modelling, Capture and Animation. Springer Verlag, mation and tracking by detection,” in IEEE International Conference
2008, vol. 36. on Computer Vision and Pattern Recognition, 2010.
[3] C. Sminchisescu, “3D Human Motion Analysis in Monocular [23] J. Gall, B. Rosenhahn, T. Brox, and H. Seidel, “Optimization and
Video: Techniques and Challenges,” in Springer, 2008. filtering for human motion capture: A multi-layer framework,”
[4] L. Sigal, A. Balan, and M. Black, “HumanEva: Synchronized International Journal of Computer Vision, 2010.
Video and Motion Capture Dataset and Baseline Algorithm for [24] L. Sigal, A. Balan, and M. J. Black, “Combined discriminative and
Evaluation of Articulated Human Motion,” International Journal of generative articulated pose and non-rigid shape estimation,” in
Computer Vision, vol. 87, no. 1, pp. 4–27, Mar. 2010. Advances in Neural Information Processing Systems, 2007.
[5] M. Yamada, L. Sigal, and M. Raptis, “No bias left behind: Co- [25] Y. Liu, J. Gall, C. Stoll, Q. Dai, H.-P. Seidel, and C. Theobalt,
variate shift adaptation for discriminative 3d pose estimation,” in “Markerless motion capture of multiple characters using multi-
European Conference on Computer Vision, 2012. view image segmentation,” IEEE Transactions on Pattern Analysis
[6] D. Vlasic, R. Adelsberger, G. Vannucci, J. Barnwell, M. Gross, and Machine Intelligence, 2013.
W. Matusik, and J. Popović, “Practical motion capture in everyday [26] P. Guan, L. Reiss, D. Hirshberg, A. Weiss, and M. J. Black, “Drape:
surroundings,” SIGGRAPH, 2007. Dressing any person,” ACM Trans. on Graphics (Proc. SIGGRAPH),
[7] G. Pons-Moll, A. Baak, J. Gall, L. Leal-Taixe, M. Mueller, H. P. vol. 31, no. 4, pp. 35:1–35:10, Jul. 2012.
Seidel, and B. Rosenhahn, “Outdoor human motion capture using [27] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and
inverse kinematics and von mises-fisher sampling,” IEEE Interna- J. Davis, “SCAPE: Shape completion and animation of people,”
tional Conference on Computer Vision, 2011. in SIGGRAPH, vol. 3, no. 24, 2005.
[8] “CMU HMC,” http://mocap.cs.cmu.edu/search.html. [28] “INRIA 4D,” http://4drepository.inrialpes.fr/pages/home.
[9] C. Sminchisescu, L. Bo, C. Ionescu, and A. Kanaujia, Feature- [29] G. Shakhnarovich, P. Viola, and T. Darrell, “Fast Pose Estimation
based Human Pose Estimation, in Guide to Visual Analysis of Humans: with Parameter Sensitive Hashing,” in IEEE International Confer-
Looking at People. Springer, 2011. ence on Computer Vision, 2003.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014 15

[30] A. Agarwal and B. Triggs, “3d human pose from silhouettes by [56] E. Marinoiu, D. Papava, and Sminchisescu, “Pictorial human
Relevance Vector Regression,” in IEEE International Conference on spaces: How well do humans perceive a 3d articulated pose?”
Computer Vision and Pattern Recognition, 2004. in IEEE International Conference on Computer Vision, 2013.
[31] L. Bourdev and J. Malik, “Poselets: Body part detectors trained [57] C. Sminchisescu and A. Jepson, “Variational Mixture Smoothing
using 3d human pose annotations,” in IEEE International Confer- for Non-Linear Dynamical Systems,” in IEEE International Confer-
ence on Computer Vision, 2009. ence on Computer Vision and Pattern Recognition, vol. 2, Washington
[32] D. Ramanan and C. Sminchisescu, “Training Deformable Models D.C., 2004, pp. 608–615.
for Localization,” in CVPR, 2006. [58] M. Brubaker and D. Fleet, “The Kneed Walker for Human Pose
[33] V. Ferrari, M. Marin, and A. Zisserman, “Pose Seach: retrieving Tracking,” in IEEE International Conference on Computer Vision and
people using their pose,” in IEEE International Conference on Pattern Recognition, 2008.
Computer Vision and Pattern Recognition, 2009. [59] R. Urtasun, D. Fleet, A. Hertzmann, and P. Fua, “Priors for people
[34] B. Sapp, A. Toshev, and B. Taskar, “Cascaded Models for Articu- tracking in small training sets,” in IEEE International Conference on
lated Pose Estimation,” in European Conference on Computer Vision, Computer Vision, 2005.
2010. [60] D. Forsyth, O. Arikan, L. Ikemoto, J. OBrien, and D. Ramanan,
[35] M. Eichner and V. Ferrari, “We are family: Joint Pose Estimation “Computational Studies of Human Motion: Part 1, Tracking and
of Multiple Persons,” in European Conference on Computer Vision, Motion Synthesis,” Foundations and Trends in Computer Graphics
2010. and Vision, 2006.
[36] Y. Wang, D. Tran, and Z. Liao, “Learning Hierarchical Poselets [61] R. Plankers and P. Fua, “Articulated Soft Objects for Video-Based
for Human Parsing,” in IEEE International Conference on Computer Body Modeling,” in IEEE International Conference on Computer
Vision and Pattern Recognition, 2011. Vision, 2001, pp. 394–401.
[37] T. Shiratori1, H. S. Park, L. Sigal, Y. Sheikh, and J. Hodgins, [62] J. Starck and A. Hilton, “Surface capture for performance-based
“Motion capture from body-mounted cameras,” SIGGRAPH, 2011. animation,” IEEE Comput. Graph. Appl., vol. 27, no. 3, pp. 21–31,
[38] C. Ionescu, F. Li, and C. Sminchisescu, “Latent Structured Models May 2007.
for Human Pose Estimation,” in IEEE International Conference on
Catalin Ionescu is a PhD candidate at Univer-
Computer Vision, November 2011.
sity of Bonn, Germany. As part of DFH-UFA,
[39] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun, “Real
a franco-german joint degree program, Catalin
time motion capture using a single time-of-flight camera,” IEEE
has received a Diplome d’ingenieur from the
International Conference on Computer Vision and Pattern Recognition,
Institut National de Sciences Appliques de Lyon
2010.
and a Diplom für Informatik from the University of
[40] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,
Karlsruhe. His main research interests are large
R. Moore, A. Kipman, and A. Blake, “Real-Time Human Pose
scale machine learning and computer vision with
Recognition in Parts from Single Depth Images,” in IEEE Interna-
emphasis on 3D pose estimation.
tional Conference on Computer Vision and Pattern Recognition, 2011. Dragos Papava graduated with a Bachelor’s
[41] C. Sminchisescu, A. Kanaujia, and D. Metaxas, “Learning Joint Degree in Computer Science from Politehnica
Top-down and Bottom-up Processes for 3D Visual Inference,” University of Bucharest and received a Master’s
in IEEE International Conference on Computer Vision and Pattern degree in computer graphics, multimedia and
Recognition, 2006. virtual reality at the same university. His major
[42] A. Kanaujia, C. Sminchisescu, and D. Metaxas, “Semi-Supervised interests are computer graphics, GPU comput-
Hierarchical Models for 3D Human Pose Reconstruction,” in IEEE ing and image processing.
International Conference on Computer Vision and Pattern Recognition,
2006.
[43] A. Agarwal and B. Triggs, “A local basis representation for Vlad Olaru holds a BS degree from the ”Po-
estimating human pose from cluttered images,” in ACCV, 2006. litehnica” University of Bucharest, Romania, an
[44] L. Pishchulin, A. Jain, M. Andriluka, T. Thormaehlen, and MS degree from Rutgers, The State University of
B. Schiele, “Articulated people detection and pose estimation: Re- New Jersey, USA and a PhD from the Technical
shaping the future,” in IEEE International Conference on Computer University of Karlsruhe, Germany. His research
Vision and Pattern Recognition, 2012. interests focus on distributed and parallel com-
[45] R. Hartley and A. Zisserman, Multiple View Geometry in Computer puting, operating systems, real-time embedded
Vision, R. Hartley and A. Zisserman, Eds. Cambridge University systems and high-performance computing for
Press, 2000. large-scale computer vision programs. His doc-
[46] C. Cortes, M. Mohri, and J. Weston, “A general regression tech- torate concentrated on developing kernel-level,
nique for learning transductions,” in International Conference on single system image services for clusters of
Machine Learning, 2005, pp. 153–160. computers. He was a key person in several EU-funded as well as
[47] A. Rahimi and B. Recht, “Random features for large-scale kernel national projects targeting the development of real-time OS software
machines,” in Advances in Neural Information Processing Systems, to control the next generation of 3D intelligent sensors, real-time Java
2007. for multi-core architectures, servers based on clusters of multi-core
[48] F. Li, G. Lebanon, and C. Sminchisescu, “Chebyshev Approxima- architectures.
tions to the Histogram χ2 Kernel,” in IEEE International Conference
Cristian Sminchisescu has obtained a doc-
on Computer Vision and Pattern Recognition, 2012.
torate in Computer Science and Applied Math-
[49] G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest-Neighbors
ematics with an emphasis on imaging, vision
methods in Learning and Vision: Theory and Practice. MIT Press,
and robotics at INRIA, France, under an Eiffel
2006.
excellence doctoral fellowship, and has done
[50] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for
postdoctoral research in the Artificial Intelligence
nearest neighbor,” in International Conference on Machine Learning,
Laboratory at the University of Toronto. He is a
2006.
member in the program committees of the main
[51] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in
conferences in computer vision and machine
machine learning,” The Annals of Statistics, Jan 2008.
learning (CVPR, ICCV, ECCV, NIPS, AISTATS),
[52] A. Rahimi and B. Recht, “Random features for large-scale kernel
area chair for ICCV07-13, and an Associate
machines,” in Advances in Neural Information Processing Systems,
Editor of IEEE PAMI. He has given more than 100 invited talks and
2007.
presentations and has offered tutorials on 3d tracking, recognition and
[53] J. Weston, O. Chapelle, A. Elisseeff, B. Schölkopf, and V. Vapnik,
optimization at ICCV and CVPR, the Chicago Machine Learning Sum-
“Kernel dependency estimation,” in NIPS, 2002.
mer School, the AERFAI Vision School in Barcelona and the Computer
[54] B. Schölkopf, A. Smola, and K. Müller, “Nonlinear component
Vision Summer School (VSS) in Zurich. His research interests are in
analysis as a kernel eigenvalue problem,” Neural Computation,
the area of computer vision (3D human pose estimation, semantic seg-
vol. 10, pp. 1299–1319, 1998.
mentation) and machine learning (optimization and sampling algorithms,
[55] G. Bakir, J. Weston, and B. Scholkopf, “Learning to find pre-
structured prediction, and kernel methods).
images,” in Advances in Neural Information Processing Systems, 2004.

You might also like