An Evaluation of 2D Human Pose Estimation Based On
An Evaluation of 2D Human Pose Estimation Based On
ABSTRACT: 2D Human Pose Estimation (2D-HPE) has been widely applied in many practical applica-
tions in life such as sports analysis, medical fall detection, human-robot interaction, using Convolutional
Neural Networks (CNNs), which has achieved many good results. In particular, the 2D-HPE results
are intermediate in the 3D Human Pose Estimation (3D-HPE) process. In this paper, we perform a
study to compare the results of 2D-HPE using versions of Residual Network (ResNet/RN) (RN-10, RN-
18, RN-50, RN-101, RN-152) on HUman 3.6M Dataset (HU-3.6M-D). We transformed the original 3D
annotation data of the Human 3.6M dataset to a 2D human pose. The estimated models are fine-tuning
based on two protocols of the HU-3.6M-D with the same input parameters in the RN versions. The best
estimate has an error of 34.96 pixels with Protocol #1 and 28.48 pixels with Protocol #3 when training
with 10 epochs, increasing the number of training epochs reduces the estimation error (15.8 pixels of
Protocol #1, 12.4 pixels of Protocol #3). The results of quantitative evaluation, comparison, analysis,
and illustration in the paper.
KEYWORDS 2D Human Pose Estimation, Residual Networks backbone, Human 3.6M Dataset, Con-
volutional Neural Networks
1. Introduction pose [8]-[9]. These studies are often grouped into the "2D
to 3D Lifting Approaches" [10].
Human pose estimation is defined as the process of local- Estimating 2D human pose based on deep learning has
izing joints of humans in the 2D or 3D space (also known two methods: The first is the regression methods, which ap-
as keypoints - elbows, wrists, etc). Estimating human pose plied a deep network to learn joints location from the input
from the captured images/video has two research direc- ground-truth joints on the images to body joints or param-
tions: 2D-HPE and 3D-HPE. If the output is a human pose eters of human body models/human skeleton to predict
on images or videos then this problem is called 2D-HPE. If the key points on the human; The second method predicts
the output is a human pose on 3D space then is called 3D- the approximate locations of body parts. Deep learning
HPE. Therefore, a lot of research on this issue in the last 5 network has achieved remarkable results for the estimation
years. The results of human pose estimation are applied in task. In which, all skeletal keypoints are regressed based on
many fields such as sports analysis [1, 2]; medical fall event the ground-truth heatmaps (2D keypoints) by 2D Gaussian
detection [3]; identification and analysis in traditional mar- kernels [11]-[12]. In particular, the 2D keypoint estimation
tial arts [4]; robot interaction, construction of actions and from the heatmap is shown in stacked hourglass networks
movements of people in the game [1]. The 2D-HPE is an [13] as start-of-the-art. However, it still faces many chal-
intermediate result for the 3D-HPE. The 3D-HPE result is lenges such as heavy occlusion, partially visible human
highly dependent on the 2D-HPE result when based on the body. RN [14] is one of the backbones with the best re-
approach of Zhou et al. [5]. To build a complete system, it is sults in feature extraction of ImagNet datasets and is used
necessary to evaluate and compare the results at each step in many CNNs to detect, segment, recognize the objects,
as in the studies of Chandrasekaran et al. [6]-[7] for build- and estimate pose (as presented in Figure 1th [15]). In this
ing a System on Chip. The authors made a test scheduling paper, we experiment to compare the estimation of 2D hu-
the algorithms on Chip. man pose based on studies using the CNNs to estimate 2D
Currently, many studies on 3D-HPE use 2D-HPE results human pose according to the regression methods. We use
on color images as an intermediary to estimate 3D human different versions of RN for 2D-HPE. The training model
is based on RN-10, RN-18, RN-50, RN-101, RN-152. The locations with the correct locations in the previous stage.
results of 2D human pose prediction are evaluated on the Thus, the currently predicted pose is refined based on each
benchmark HU-3.6M-D, which is a widely used and chal- subsequent stage. Liang et al. [23] a strategy of composi-
lenging dataset, the body parts of the human are obscured. tional pose regression based on the RN-50 [14]. The bones
To get the 2D human pose annotation data of the HU-3.6M- are parameterized and bone-based representation that con-
D for the 2D-HPE, we perform an inverse transformation tains human skeleton information and skeleton structure
from the 3D pose annotation of human in the Real-World but did not use joint-based representation. The loss func-
Coordinate System (R-WCS) of MOCAP system to 2D pose tion is calculated based on each part of the human body, the
annotation of human according to image coordinates based joints are defined based on a constant origin point in the im-
on the set of intrinsic parameters provided for calibration of age coordinate system J0 . Each bone has a directed vector
the image data. The results are presented in the following pointing from it to its parent. Luvizon et al. [24] proposed
part of the paper. a regression method that used two Soft-argmax functions
In this paper, we have some contributions as follows: (Block-A and Block-B) for 2D human pose estimation from
images, Block-A provides refined features and Block-B pro-
• We have fine-tuned different versions of the RN with
vides skeleton-part and active context maps. Two blocks
the size (224 × 224) of input data to estimate the 2D
are used to build one prediction block. Block-A used a
human pose in the RGB image.
residual separable convolution, the input feature maps are
• We have fine-tuned the estimated model on the transformed into part-based detection maps and context
HU-3.6M-D, with 2D pose ground-truth determined maps by Block-B.
based on 3D pose annotation data and the intrinsic As for the body-part detection methods, a body-part
parameters of the camera. detector is trained to predict the locations of human joints.
Newell et al. [13] proposed the stacked hourglass archi-
• We evaluate the estimated results based on the abso-
tecture for the training model to predict the positions of
lute estimated coordinates between the original data
body joints on the heatmap in which the 2D annotation
and the estimated data. From there, choose the best
is used to generate the heatmap by 2D Gaussian heatmap
RN version with input data of 224 × 224 for 2D-HPE
method. The stacked hourglass repeats the bottom-up and
on the RGB image, and will have good results in 3D-
top-down processing with intermediate supervision with
HPE.
the eight hourglasses. This CNN used the convolutional
The paper is organized as follows. Section 1 introduces and max-pooling layers at a very low resolution and used
several backbones for detecting and estimating people on then the top-down sequence of upsampling (the nearest
images. Section 2 presents the related studies on 2D-HPE neighbor upsampling of the lower resolution) and a combi-
methods. Section 3 presents the main idea and versions of nation of features across scales. The results of 2D-HPE on
RN. Section 4 shows and discusses the experimental results the MPII dataset based on the ([email protected]) measurement
of 2D human pose estimation, and Section 5 concludes the are 90.9%, 99.0%, 97.0% are the results of the FLIC dataset
paper and future works. on the ([email protected]) measurement.
Cao et al. [25] proposed a two-branch CNN model, the
body part detection is predicted from heatmaps by using
2. Related Works the 2D keypoints annotation to generate the ground-truth
confidence maps. The confidence maps are predicted by
RN [14] is a backbone applied to many CNNs for feature the first branch and the part affinity fields are predicted
extraction and object prediction in the first step such as Fast by the second branch. The part affinity fields are a novel
R-CNN [16], Faster R-CNN [17], Mask R-CNN [18], etc. Fig- feature representation of both location and orientation in-
ure 1 shows the RN as the backbone in the Mask R-CNN formation across the limb’s active region.
network architecture. RN [14] is more efficient than other
Most of the above studies were evaluated on the COCO
backbones like AlexNet [19], VGG [20], [21].
[26], MPII [27] datasets and evaluated on the Percentage
2D-HPE from RGB image data using CNN can be done
of Correct Keypoints P CK − % measure. This measure is
by two methods [10]: regression methods, body part detec-
usually based on the estimated joint length with the root
tion methods.
joint length, without taking into account the absolute esti-
The regression methods use the CNNs model to learn
mates of the 2D keypoints (estimated absolute coordinates
joints location from the input ground-truth joints on the
and ground-truth coordinates).
images to body joints or parameters of human body mod-
els/human skeleton to predict the key points on the human.
Toshev et al. [22] proposed a Deep Neural Network (DNN)
based on the cascade technique for regressing the location
3. 2D-HPE Based on The RN and Its
of body joints. The proposed CNN includes seven layers, Variations
the input image size of CNN is resized to 220 × 220 pixels.
The cascade of pose regressors technique is applied to train 2D-HPE is an intermediate result to estimate 3D human
the multi-layer prediction model. The first stage is the cas- posture according to the method: 2D to 3D lifting methods
cade starts with the initial position predicted over the entire and model-based methods [10]. Therefore, the 2D-HPE re-
input image. In the next stage, Deep Neural Networks re- sults have a great influence on the 3D-HPE results. The RN
gressors are trained to predict a displacement of the joint is applied in many studies on human pose estimation and
Figure 1: The human instance segmentation model on the image based on the Mask R-CNN architecture. Mask R-CNN is generated based on the
combination of Faster R-CNN [17] and FCN.
ResNet
gives good results [28]-[29]. That is the motivation for us layers), RN-101 (101 Conv layers) ), RN-152 (152 Conv lay-
to carry out this study to select a model with good 2D pose ers), as shown in Fig. 3.
estimation results. We compared results from different ver- RN is a DNN designed to work with hundreds or thou-
sions of the RN (RN-10, RN-18, RN-34, RN-50, RN-101, sands of convolutional layers. When building a CNN net-
RN-152) [14] to select the best results. work with many convolutional layers, the Vanishing Gradi-
Residual Network (ResNet/RN) was introduced in 2015 ent phenomenon occurs, leading to bad model training re-
and the 1st place in the 2015 ILSVRC challenge with an er- sults. The Vanishing Gradient phenomenon is presented as
ror rate is only 3.57%. Currently, there are many variations follows: The training process in DNN often uses Backprop-
of RN architecture with a different number of layers. The agation Algorithm [30]. The main idea of this algorithm
named RN is followed by a number indicating the RN ar- is that the output of the current layer is the input of the
chitecture with a certain number of layers. The RN have the next layer and computes the corresponding cost function
number with each version RN-10 (10 Conv layers), RN-18 gradient for each parameter (weight) of the network. The
(18 Conv layers), RN-34 (34 Conv layers), RN-50 (50 Conv Gradient Descent is then used to update those parameters.
The above process will be repeated until the parameters a curved arrow starting at the beginning and ending at the
of the network are converged. Normally we would have end of the residual block as Fig. 4. In other words, it will
a hyper-parameter (the number of epochs - the number of add an input x value to the output of the layer, which will
times the training set is traversed once and the weights counteract the zero derivatives, since x is still added. With
updated) that defines the number of iterations to perform Hx being the predicted value, F x is the label, the desired
this process. If the number of loops is too small, then the output Hx to be equal to or approximately F x.
network may not give good results, and vice versa, the train- When the input of the network is the same as the output
ing time will be longer if the number of loops is too large. of the network, RN uses identity block, otherwise use the
However, in practice Gradients will often have smaller val- convolutional block, as presented Fig. 5.
ues at lower layers. As a result, the updates performed by
Gradients Descent do not change much of the weights of
those layers and make them not converged and the network
does not work well. This phenomenon is called "Vanishing
Gradients". RN proposed to use a uniform "identity short-
cut connection" connection to traverse one or more layers,
illustrated in Fig. 4.
4. Experimental Results
4.1. Dataset
To fine-tune, generate and evaluate the model and the es-
timated model, we use the benchmark HU-3.6M-D [32].
HU-3.6M-D is the indoor dataset for the evaluation of 3D-
HPE from single-view of the cameras or multi-view of the
cameras(the data is collected in a Lab environment from
Figure 4: A Residual Block across two layers of RN. 4 different perspectives). This dataset is captured from 11
subjects/people (6 males and 5 females), the people per-
RN is like other CNNs, includes convolution, pooling, form with six types of action (upper body directions move-
activation, and fully-connected layer. In RN, there appears ment, full body upright variations, walking variations, vari-
ations while seated on a chair, sitting on the floor, various where R and T are the rotation and translation parameters
movements) which includes 16 daily activities (directions, to transform from the R-WCS to the CCS. P 3Dw is the coor-
discussion, greeting, posing, purchases, taking photo, wait- dinate of the keypoint in the R-WCS. We also projected to
ing, walking, walking dog, walking pair, eating, phone 2D human pose annotation using Eq. 3.
talk, sitting, smoking, sitting down, miscellaneous). The
frames are captured from TOF (Time-of-Flight) cameras, P 3Dc .x ∗ f x
P 2D.x = + cx
the data frame rate of the cameras is from 25 to 50 Hz. This P 3Dc .z
(3)
dataset contains about 3.6 million images (1,464,216 frames P 3Dc .y ∗ f y
P 2D.y = + cy
for training - 5 people (2 female and 3 male), 646,180 frames P 3Dc .z
for validation - 2 people (1 female and 1 male), 1,467,684
frames for testing - 4 people (2 female and 2 male)), 3.6 mil- where P 2D is the coordinate of the keypoint in the im-
lion 3D human pose annotations captured by the marker- age.
based MoCap system. 3D human pose annotation of HU- The source code and HU-3.6M-D have 2D annotation
3.6M-D consists of 17 key points arranged in order as shown data as shown in link 1.
in Fig. 7. The authors have divided the HU-3.6M-D into 3 proto-
3D human pose annotations of HU-3.6M-D are anno- cols to train and test the estimation models. Protocol #1
tated based on the Mocap system. The coordinate system includes Subject #1, Subject #5, Subject #6, and Subject #7
of this data is the R-WCS. To evaluate the estimation re- for the training model, and Subject #9 and Subject #11 for
sults, we convert this data to the Camera Coordinate System the testing model. Protocol #2 is divided similarly to Proto-
(CCS). We based on the parameter set of the cameras and col #1. However, the predictions are further post-processed
the formula for converting data from 2D to 3D of Nicolas by a rigid transformation before comparing to the ground-
[33] by Eq. 1. truth. Protocol #3 includes Subject #1, Subject #5, Subject
#6, Subject #7, and Subject #9 for the training model, and
Subject #11 for the testing model. This dataset is saved in
xd − cx ∗ depthxd , yd
P 3Dc .x = path 2.
fx
yd − cy ∗ depthxd , yd (1)
P 3Dc .y =
fy
4.2. Implementation
P 3Dc .z = depthxd , yd The input data of our network includes color/RGB image
data and 2D human pose annotation. All images are resized
where f x, f y, cx and cy are the intrinsics of the depth
to (224 × 224) before being fed to the network.
camera. P 3Dc is the coordinate of the keypoint in the CCS.
In this paper, the loss function for training the estima-
Before evaluating the results of the 2D posture estima-
tion model includes two parts: L1 , L2 . We used the loss
tion, we re-projected the 3D human pose annotation from
function L1 and Adam optimizer for the training process.
the R-WCS to the CCS using Eq. 2.
First, we initialized the loss function L1 for 2D coordinates
P 3Dc = P 3Dw − T ∗ R−1 (2) predicted from RN. Then, we computed the loss function
1https://drive.google.com/drive/folders/1s3VmcZL8M2EK2M_Ese1-EWBVEZ5brM7Q?usp=sharing
2http://vision.imar.ro/human3.6m/
L = α * L1 + β ∗ L2 (4)
1 N1 J
Erravg Σ Σ L2pi , pei (5) Figure 9: The average error between the 2D keypoint annotation and the
N 1 J 1 estimated 2D keypoint of Protocol #3 on the validation set.
Table 1: The average error (Erravg ) between the 2D keypoint annotation and the estimated 2D keypoint of Protocol #3 of the HU-3.6M-D.
RN RN RN RN RN
CNNs
-10 -18 -50 -101 -152
/ Average
28.48 29.35 578.99 602.44 593.48
Error
(Erravg )
(Pixels)
Figure 10: Illustrating a 2D-HPE result on the image of Protocol #1 on Subject #9, Subject #11.
Figure 11: Illustrating a 2D-HPE result on the image of Protocol #1 on Subject #9 of RN-10.
RN-10 network is a smaller CNN than other networks, [8] J. Martinez, R. Hossain, J. Romero, J. J. Little, “A Simple Yet Effective
which proves that a smaller number of convolutional layers Baseline for 3d Human Pose Estimation”, “Proceedings of the IEEE
International Conference on Computer Vision”, vol. 2017-Octob, pp.
will make the network converge faster. This is also consis- 2659–2668, 2017.
tent with the explanation that smaller networks will learn
more efficiently than large CNNs [34]. [9] S. Li, L. Ke, K. Pratama, Y.-W. Tai, C.-K. Tang, K.-T. Cheng, “Cas-
caded deep monocular 3d human pose estimation with evolutionary
Figure 11 shows the result sequences of 2D-HPE of RN- training data”, “The IEEE/CVF Conference on Computer Vision and
10 on Protocol #1 of Subject #9. The estimated 2D keypoints Pattern Recognition (CVPR)”, 2020.
are blue-green nodes, the joints between the estimated 2D
[10] Q. Dang, J. Yin, B. Wang, W. Zheng, “Deep learning based 2D human
keypoints are the red lines. pose estimation: A survey”, TPAMI, vol. 24, no. 6, pp. 663–676, 2021,
doi:10.26599/TST.2018.9010100.
5. Conclusions and Future Works [11] Z. Luo, Z. Wang, Y. Huang, L. Wang, T. Tan, E. Zhou, “Rethinking
the Heatmap Regression for Bottom-up Human Pose Estimation”,
“CVPR”, pp. 13259–13268, 2021, doi:10.1109/cvpr46437.2021.01306.
In this paper, we have performed a comparative study for
2D-HPE based on versions of RN (RN-10, RN-18, RN-50, [12] A. Bulat, G. Tzimiropoulos, “Human pose estimation via convolu-
tional part heatmap regression”, “European Conference on Com-
RN-101, RN-152) on HU-3.6M-D with two evaluations Pro- puter Vision”, vol. 9911 LNCS, pp. 717–732, 2016, doi:10.1007/
tocols (Protocol #1, Protocol #3). We have transformed 3D 978-3-319-46478-7_44.
human pose annotation data to 2D human pose annotation.
[13] A. Newell, K. Yang, J. Deng, “Stacked Hourglass Networks for Hu-
The average error of the RN-10 is 34.96 pixels, 28.48 pixels, man Pose Estimation”, “ECCV”, 2016.
respectively, which is the best result on Protocol #1, Proto-
[14] K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for im-
col #3. The results are evaluated and shown in detail and
age recognition”, “IEEE Conference on CVPR”, vol. 2016-Decem, pp.
visually on the images. Therefore, RN-10 is a good CNN 770–778, 2016, doi:10.1109/CVPR.2016.90.
for estimating 2D human pose on images, this result can be
[15] R. Zhang, L. Du, Q. Xiao, J. Liu, “Comparison of Backbones for
used to estimate 3D human pose. In the future, we will use Semantic Segmentation Network”, “Journal of Physics: Conference
the human pose estimation results of RN-10 for 3D-HPE to Series”, vol. 1544, 2020, doi:10.1088/1742-6596/1544/1/012196.
compare with the studies of Zheng et al. [35] and Li et al.
[16] R. Girshick, “Fast R-CNN”, “Proceedings of the IEEE International
[9], which have the best results currently on the 3D-HPE. Conference on Computer Vision”, vol. 2015 Inter, pp. 1440–1448, 2015,
doi:10.1109/ICCV.2015.169.
Conflict of Interest: The paper is our research, not related [17] S. Ren, K. He, R. Girshick, J. Sun, “Faster R-CNN: Towards Real-Time
to any organization or individual. It is part of a series of Object Detection with Region Proposal Networks”, IEEE Transactions
studies by 2D, 3D human pose estimation. on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–
1149, 2017, doi:10.1109/TPAMI.2016.2577031.
Acknowledgement: This research is funded by Tan Trao [18] K. He, G. Gkioxari, P. Dollar, R. Girshick, “Mask R-CNN”, “ICCV”,
2017.
University in TuyenQuang, Viet Nam.
[19] A. Krizhevsky, I. Sutskever, G. E. Hinton, “Imagenet classification
with deep convolutional neural networks”, F. Pereira, C. J. C. Burges,
References L. Bottou, K. Q. Weinberger, eds., “Advances in Neural Information
Processing Systems”, vol. 25, Curran Associates, Inc., 2012.
[1] N. S. Willett, H. V. Shin, Z. Jin, W. Li, A. Finkelstein, “Pose2Pose:
Pose Selection and Transfer for 2D Character Animation”, “Interna- [20] M. Lin, Q. Chen, S. Yan, “Network in network”, “2nd International
tional Conference on Intelligent User Interfaces, Proceedings IUI”, Conference on Learning Representations, ICLR 2014 - Conference
pp. 88–99, 2020, doi:10.1145/3377325.3377505. Track Proceedings”, pp. 1–10, 2014.
[2] H. Zhang, C. Sciutto, M. Agrawala, K. Fatahalian, “Vid2Player: Con- [21] K. Simonyan, A. Zisserman, “Very deep convolutional networks for
trollable Video Sprites That Behave and Appear Like Professional large-scale image recognition”, Y. Bengio, Y. LeCun, eds., “3rd Inter-
Tennis Players”, ACM Transactions on Graphics, vol. 40, no. 3, pp. 1–16, national Conference on Learning Representations, ICLR 2015, San
2021, doi:10.1145/3448978. Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings”, 2015.
[27] M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, “2d human pose Currently, he is a lecture of Tan Trao
estimation: New benchmark and state of the art analysis”, “IEEE University. His research interests in-
Conference on Computer Vision and Pattern Recognition (CVPR)”,
2014.
clude Computer vision, RANSAC and
RANSAC variation and 3-D object detection, recognition;
[28] X. Xiao, W. Wan, “Human pose estimation via improved ResNet50”, machine leaning, deep learning.
“4th International Conference on Smart and Sustainable City (ICSSC
2017)”, vol. 148, pp. 148–162.
[29] Y. Wang, T. Wang, “Cycle Fusion Network for Multi-Person Pose Es-
timation”, Journal of Physics: Conference Series, vol. 1550, no. 3, 2020.
Trung-Minh Bui received Bachelor de-
gree at Thainguyen of Information
[30] N. Benvenuto, F. Piazza, “On the Complex Backpropagation Algo- and Comunication Technology (ICTU)
rithm”, IEEE Transactions on Signal Processing, vol. 40, no. 4, pp. 967–
969, 1992, doi:10.1109/78.127967.
(2010). He received M.Sc. degree at
Thainguyen of Information and Comu-
[31] N. V. Hieu, N. L. H. Hien, “Recognition of plant species using deep nication Technology (ICTU) (2014). Cur-
convolutional feature extraction”, International Journal on Emerging rently, he is a lecture of Tan Trao Univer-
Technologies, vol. 11, no. 3, pp. 904–910, 2020.
sity. His research interests include Com-
[32] C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu, “Human3.6m: puter science; machine leaning, deep
Large scale datasets and predictive methods for 3d human sensing learning.
in natural environments”, TPAMI, vol. 36, no. 7, pp. 1325–1339, 2014.