An Overview of Human Pose Estimation With Deep Learning
An Overview of Human Pose Estimation With Deep Learning
Bharath Raj
Apr 28 · 9 min read
person. Each co-ordinate in the skeleton is known as a part (or a joint, or a keypoint). A
valid connection between two parts is known as a pair (or a limb). Note that, not all
part combinations give rise to valid pairs. A sample human pose skeleton is shown
below.
Left: COCO keypoint format for human pose skeletons. Right: Rendered human pose skeletons. (Source)
Knowing the orientation of a person opens avenues for several real-life applications,
some of which are discussed towards the end of this blog. Several approaches to
Human Pose Estimation were introduced over the years. The earliest (and slowest)
methods typically estimating the pose of a single person in an image which only had
one person to begin with. These methods often identify the individual parts first,
followed by forming connections between them to create the pose.
Naturally, these methods are not particularly useful in many real-life scenarios where
images contain multiple people.
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 2/11
21/08/2019 An Overview of Human Pose Estimation with Deep Learning
Another approach is to detect all parts in the image (i.e. parts of every person),
followed by associating/grouping parts belonging to distinct persons. This method
is known as the bottom-up approach.
Top: Typical Top-Down approach. Bottom: Typical Bottom-Up approach. (Image Source)
Typically, the top-down approach is easier to implement than the bottom-up approach
as adding a person detector is much simpler than adding associating/grouping
algorithms. It is hard to judge which approach has better performance overall as it
really comes down to which among the person detector and associating/grouping
algorithms is better.
In this blog, we will focus on multi-person human pose estimation using deep learning
techniques. In the next section, we will review some of the popular top-down and
bottom-up approaches for the same.
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 3/11
21/08/2019 An Overview of Human Pose Estimation with Deep Learning
1. OpenPose
OpenPose is one of the most popular bottom-up approaches for multi-person human
pose estimation, partly because of their well documented GitHub implementation.
The OpenPose network first extracts features from an image using the first few layers
(VGG-19 in the above flowchart). The features are then fed into two parallel branches
of convolutional layers. The first branch predicts a set of 18 confidence maps, with
each map representing a particular part of the human pose skeleton. The second
branch predicts a set of 38 Part Affinity Fields (PAFs) which represents the degree of
association between parts.
Successive stages are used to refine the predictions made by each branch. Using the
part confidence maps, bipartite graphs are formed between pairs of parts (as shown in
the above image). Using the PAF values, weaker links in the bipartite graphs are
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 4/11
21/08/2019 An Overview of Human Pose Estimation with Deep Learning
pruned. Through the above steps, human pose skeletons can be estimated and assigned
to every person in the image. For a more thorough explanation of the algorithm, you
may refer to their paper and to this blog post.
2. DeepCut
DeepCut is a bottom-up approach for multi-person human pose estimation. The
authors approached the task by defining the following problems:
1. Produce a set of D body part candidates. This set represents all possible locations
of body parts for every person in the image. Select a subset of body parts from the
above set of body part candidates.
2. Label each selected body part with one of C body part classes. The body part
classes represent the types of parts, such as “arm”, “leg”, “torso” etc.
The above problems were jointly solved by modeling it into an Integer Linear
Programming (ILP) problem. It is modeled by considering triples (x, y, z) of binary
random variables with domains as stated in the images below.
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 5/11
21/08/2019 An Overview of Human Pose Estimation with Deep Learning
Consider two body part candidates d and d' from the set of body part candidates D
and classes c and c' from the set of classes C . The body part candidates were
obtained through a Faster RCNN or a Dense CNN. Now, we can develop the following
set of statements.
Also, y(d,d') = 1 indicates that body part candidates d and d' belong to the
same person.
They also define z(d,d’,c,c’) = x(d,c) * x(d’,c’) * y(d,d’) . If the above value is
1, then it means that body part candidate d belongs to class c , body part
candidate d' belongs to class c' , and finally body part candidates d,d’ belong to
the same person.
The last statement can be used to partition pose belonging to different people. Clearly,
the above statements can be formulated in terms of linear equations as functions of
(x,y,z) . In this way, the Integer Linear Program (ILP) is set up, and the pose of
multiple persons can be estimated. For the exact set of equations and much more
detailed analysis, you can check out their paper here.
3. RMPE (AlphaPose)
RMPE is a popular top-down method of Pose Estimation. The authors posit that top-
down methods are usually dependent on the accuracy of the person detector, as pose
estimation is performed on the region where the person is located. Hence, errors in
localization and duplicate bounding box predictions can cause the pose extraction
algorithm to perform sub-optimally.
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 6/11
21/08/2019 An Overview of Human Pose Estimation with Deep Learning
E ect of duplicate predictions (left) and low con dence bounding boxes (right). (Source)
To resolve this issue, the authors proposed the usage of Symmetric Spatial Transformer
Network (SSTN) to extract a high-quality single person region from an inaccurate
bounding box. A Single Person Pose Estimator (SPPE) is used in this extracted region to
estimate the human pose skeleton for that person. A Spatial De-Transformer Network
(SDTN) is used to remap the estimated human pose back to the original image
coordinate system. Finally, a parametric pose Non-Maximum Suppression (NMS)
technique is used to handle the issue of redundant pose deductions.
4. Mask RCNN
Mask RCNN is a popular architecture for performing semantic and instance
segmentation. The model parallelly predicts both the bounding box locations of the
various objects in the image and a mask that semantically segments the object. The
basic architecture can be quite easily extended for human pose estimation.
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 7/11
21/08/2019 An Overview of Human Pose Estimation with Deep Learning
The basic architecture first extracts feature maps from an image using a CNN. These
feature maps are used by a Region Proposal Network (RPN) to get bounding box
candidates for the presence of objects. The bounding box candidates select an area
(region) from the feature map extracted by the CNN. Since the bounding box
candidates can be of various sizes, a layer called RoIAlign is used to reduce the size of
the extracted feature such that they are all of the uniform size. Now, this extracted
feature is passed into the parallel branches of CNNs for final prediction of the bounding
boxes and the segmentation masks.
Let us focus on the branch that performs segmentation. Suppose an object in our image
can belong to one among K classes. The segmentation branch outputs K binary masks
of size m x m , where each binary mask represents all objects belonging to that class
alone. We can extract keypoints belonging to every person in the image by modeling
each type of keypoint as a distinct class and treating this like a segmentation problem.
Parallely, the objection detection algorithm can be trained to identify the location of
the persons. By combining the information of the location of the person as well as their
set of keypoints, we obtain the human pose skeleton for every person in the image.
This method nearly resembles the top-down approach, but the person detection stage is
performed in parallel to the part detection stage. In other words, the keypoint detection
stage and person detection stage are independent of each other.
5. Other Methods
Multi-Person Human Pose Estimation is a vast field with a plethora of approaches to
tackle the problem. For brevity, only a select few approaches are explained here. For a
more exhaustive list of approaches, you may check out the following links:
Applications
Pose Estimation has applications in myriad fields, some of which are listed below.
1. Activity Recognition
Tracking the variations in the pose of a person over a period of time can also be used
for activity, gesture and gait recognition. There are several use cases for the same,
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 8/11
21/08/2019 An Overview of Human Pose Estimation with Deep Learning
including:
Applications that can autonomously teach proper work out regimes, sport
techniques and dance activities.
Applications that can understand full-body sign language. (Ex: Airport runway
signals, traffic policemen signals, etc.).
Tracking the gait of the person is useful for security and surveillance purposes. (Image source)
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 9/11
21/08/2019 An Overview of Human Pose Estimation with Deep Learning
A good visual example of what is possible can be seen through Animoji. Even though
the above only tracks the structure of a face, the idea can be extrapolated for the
keypoints of a person. The same concepts can be leveraged to render Augmented
Reality (AR) elements that can mimic the movements of a person.
3. Training Robots
Instead of manually programming robots to follow trajectories, robots can be made to
follow the trajectories of a human pose skeleton that is performing an action. A human
instructor can effectively teach the robot certain actions by just demonstrating the
same. The robot can then calculate how to move its articulators to perform the same
action.
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 10/11
21/08/2019 An Overview of Human Pose Estimation with Deep Learning
Conclusion
Great strides have been made in the field of human pose estimation, which enables us
to better serve the myriad applications that are possible with it. Moreover, research in
related fields such as Pose Tracking can greatly enhance its productive utilization in
several fields. The concepts listed in this blog are not exhaustive but rather strives to
introduce some popular variants of these algorithms and their real-life applications.
Machine Learning Arti cial Intelligence Deep Learning Data Science Pose Estimation
https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b 11/11