0% found this document useful (0 votes)
4 views

An Evaluation of 2D Human Pose Estimation Based On

This document presents a study on 2D Human Pose Estimation (2D-HPE) using various versions of Residual Networks (ResNet) to evaluate their performance on the Human 3.6M Dataset. The authors transformed 3D pose data into 2D annotations and fine-tuned the models, achieving the best estimation errors of 34.96 pixels and 28.48 pixels with different protocols, which improved with increased training epochs. The paper discusses the methodology, results, and implications for 3D Human Pose Estimation (3D-HPE).

Uploaded by

Loan Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

An Evaluation of 2D Human Pose Estimation Based On

This document presents a study on 2D Human Pose Estimation (2D-HPE) using various versions of Residual Networks (ResNet) to evaluate their performance on the Human 3.6M Dataset. The authors transformed 3D pose data into 2D annotations and fine-tuned the models, achieving the best estimation errors of 34.96 pixels and 28.48 pixels with different protocols, which improved with increased training epochs. The paper discusses the methodology, results, and implications for 3D Human Pose Estimation (3D-HPE).

Uploaded by

Loan Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Special Issue on Multidisciplinary Sciences and Advanced Technology

Received: 08 January , 2022, Accepted: 26 February, 2022, Online: 17 March, 2022


DOI: https://dx.doi.org/10.55708/js0103007

An Evaluation of 2D Human Pose Estimation based on ResNet Back-


bone
Hai-Yen -Tran1 , Trung-Minh Bui2 , Thi-Loan Pham3 , Van-Hung Le∗,2

1 Tan Trao University, Tuyen Quang, 22000, Vietnam


2 Vietnam Academy of Dance, HaNoi, 100000, Vietnam
3 Hai Duong College, HaiDuong, 02203, Vietnam
∗ Corresponding author: Van-Hung Le, Tuyen Quang province, Email: [email protected]
Corresponding author ORCID: https://orcid.org/0000-0003-4302-0581

ABSTRACT: 2D Human Pose Estimation (2D-HPE) has been widely applied in many practical applica-
tions in life such as sports analysis, medical fall detection, human-robot interaction, using Convolutional
Neural Networks (CNNs), which has achieved many good results. In particular, the 2D-HPE results
are intermediate in the 3D Human Pose Estimation (3D-HPE) process. In this paper, we perform a
study to compare the results of 2D-HPE using versions of Residual Network (ResNet/RN) (RN-10, RN-
18, RN-50, RN-101, RN-152) on HUman 3.6M Dataset (HU-3.6M-D). We transformed the original 3D
annotation data of the Human 3.6M dataset to a 2D human pose. The estimated models are fine-tuning
based on two protocols of the HU-3.6M-D with the same input parameters in the RN versions. The best
estimate has an error of 34.96 pixels with Protocol #1 and 28.48 pixels with Protocol #3 when training
with 10 epochs, increasing the number of training epochs reduces the estimation error (15.8 pixels of
Protocol #1, 12.4 pixels of Protocol #3). The results of quantitative evaluation, comparison, analysis,
and illustration in the paper.

KEYWORDS 2D Human Pose Estimation, Residual Networks backbone, Human 3.6M Dataset, Con-
volutional Neural Networks

1. Introduction pose [8]-[9]. These studies are often grouped into the "2D
to 3D Lifting Approaches" [10].
Human pose estimation is defined as the process of local- Estimating 2D human pose based on deep learning has
izing joints of humans in the 2D or 3D space (also known two methods: The first is the regression methods, which ap-
as keypoints - elbows, wrists, etc). Estimating human pose plied a deep network to learn joints location from the input
from the captured images/video has two research direc- ground-truth joints on the images to body joints or param-
tions: 2D-HPE and 3D-HPE. If the output is a human pose eters of human body models/human skeleton to predict
on images or videos then this problem is called 2D-HPE. If the key points on the human; The second method predicts
the output is a human pose on 3D space then is called 3D- the approximate locations of body parts. Deep learning
HPE. Therefore, a lot of research on this issue in the last 5 network has achieved remarkable results for the estimation
years. The results of human pose estimation are applied in task. In which, all skeletal keypoints are regressed based on
many fields such as sports analysis [1, 2]; medical fall event the ground-truth heatmaps (2D keypoints) by 2D Gaussian
detection [3]; identification and analysis in traditional mar- kernels [11]-[12]. In particular, the 2D keypoint estimation
tial arts [4]; robot interaction, construction of actions and from the heatmap is shown in stacked hourglass networks
movements of people in the game [1]. The 2D-HPE is an [13] as start-of-the-art. However, it still faces many chal-
intermediate result for the 3D-HPE. The 3D-HPE result is lenges such as heavy occlusion, partially visible human
highly dependent on the 2D-HPE result when based on the body. RN [14] is one of the backbones with the best re-
approach of Zhou et al. [5]. To build a complete system, it is sults in feature extraction of ImagNet datasets and is used
necessary to evaluate and compare the results at each step in many CNNs to detect, segment, recognize the objects,
as in the studies of Chandrasekaran et al. [6]-[7] for build- and estimate pose (as presented in Figure 1th [15]). In this
ing a System on Chip. The authors made a test scheduling paper, we experiment to compare the estimation of 2D hu-
the algorithms on Chip. man pose based on studies using the CNNs to estimate 2D
Currently, many studies on 3D-HPE use 2D-HPE results human pose according to the regression methods. We use
on color images as an intermediary to estimate 3D human different versions of RN for 2D-HPE. The training model

www.jenrs.com Journal of Engineering Research and Sciences, 1(3): 59-67, 2022 59


Van-Hung Le et al., An Evaluation of 2D Human Pose Estimation

is based on RN-10, RN-18, RN-50, RN-101, RN-152. The locations with the correct locations in the previous stage.
results of 2D human pose prediction are evaluated on the Thus, the currently predicted pose is refined based on each
benchmark HU-3.6M-D, which is a widely used and chal- subsequent stage. Liang et al. [23] a strategy of composi-
lenging dataset, the body parts of the human are obscured. tional pose regression based on the RN-50 [14]. The bones
To get the 2D human pose annotation data of the HU-3.6M- are parameterized and bone-based representation that con-
D for the 2D-HPE, we perform an inverse transformation tains human skeleton information and skeleton structure
from the 3D pose annotation of human in the Real-World but did not use joint-based representation. The loss func-
Coordinate System (R-WCS) of MOCAP system to 2D pose tion is calculated based on each part of the human body, the
annotation of human according to image coordinates based joints are defined based on a constant origin point in the im-
on the set of intrinsic parameters provided for calibration of age coordinate system J0 . Each bone has a directed vector
the image data. The results are presented in the following pointing from it to its parent. Luvizon et al. [24] proposed
part of the paper. a regression method that used two Soft-argmax functions
In this paper, we have some contributions as follows: (Block-A and Block-B) for 2D human pose estimation from
images, Block-A provides refined features and Block-B pro-
• We have fine-tuned different versions of the RN with
vides skeleton-part and active context maps. Two blocks
the size (224 × 224) of input data to estimate the 2D
are used to build one prediction block. Block-A used a
human pose in the RGB image.
residual separable convolution, the input feature maps are
• We have fine-tuned the estimated model on the transformed into part-based detection maps and context
HU-3.6M-D, with 2D pose ground-truth determined maps by Block-B.
based on 3D pose annotation data and the intrinsic As for the body-part detection methods, a body-part
parameters of the camera. detector is trained to predict the locations of human joints.
Newell et al. [13] proposed the stacked hourglass archi-
• We evaluate the estimated results based on the abso-
tecture for the training model to predict the positions of
lute estimated coordinates between the original data
body joints on the heatmap in which the 2D annotation
and the estimated data. From there, choose the best
is used to generate the heatmap by 2D Gaussian heatmap
RN version with input data of 224 × 224 for 2D-HPE
method. The stacked hourglass repeats the bottom-up and
on the RGB image, and will have good results in 3D-
top-down processing with intermediate supervision with
HPE.
the eight hourglasses. This CNN used the convolutional
The paper is organized as follows. Section 1 introduces and max-pooling layers at a very low resolution and used
several backbones for detecting and estimating people on then the top-down sequence of upsampling (the nearest
images. Section 2 presents the related studies on 2D-HPE neighbor upsampling of the lower resolution) and a combi-
methods. Section 3 presents the main idea and versions of nation of features across scales. The results of 2D-HPE on
RN. Section 4 shows and discusses the experimental results the MPII dataset based on the ([email protected]) measurement
of 2D human pose estimation, and Section 5 concludes the are 90.9%, 99.0%, 97.0% are the results of the FLIC dataset
paper and future works. on the ([email protected]) measurement.
Cao et al. [25] proposed a two-branch CNN model, the
body part detection is predicted from heatmaps by using
2. Related Works the 2D keypoints annotation to generate the ground-truth
confidence maps. The confidence maps are predicted by
RN [14] is a backbone applied to many CNNs for feature the first branch and the part affinity fields are predicted
extraction and object prediction in the first step such as Fast by the second branch. The part affinity fields are a novel
R-CNN [16], Faster R-CNN [17], Mask R-CNN [18], etc. Fig- feature representation of both location and orientation in-
ure 1 shows the RN as the backbone in the Mask R-CNN formation across the limb’s active region.
network architecture. RN [14] is more efficient than other
Most of the above studies were evaluated on the COCO
backbones like AlexNet [19], VGG [20], [21].
[26], MPII [27] datasets and evaluated on the Percentage
2D-HPE from RGB image data using CNN can be done
of Correct Keypoints P CK − % measure. This measure is
by two methods [10]: regression methods, body part detec-
usually based on the estimated joint length with the root
tion methods.
joint length, without taking into account the absolute esti-
The regression methods use the CNNs model to learn
mates of the 2D keypoints (estimated absolute coordinates
joints location from the input ground-truth joints on the
and ground-truth coordinates).
images to body joints or parameters of human body mod-
els/human skeleton to predict the key points on the human.
Toshev et al. [22] proposed a Deep Neural Network (DNN)
based on the cascade technique for regressing the location
3. 2D-HPE Based on The RN and Its
of body joints. The proposed CNN includes seven layers, Variations
the input image size of CNN is resized to 220 × 220 pixels.
The cascade of pose regressors technique is applied to train 2D-HPE is an intermediate result to estimate 3D human
the multi-layer prediction model. The first stage is the cas- posture according to the method: 2D to 3D lifting methods
cade starts with the initial position predicted over the entire and model-based methods [10]. Therefore, the 2D-HPE re-
input image. In the next stage, Deep Neural Networks re- sults have a great influence on the 3D-HPE results. The RN
gressors are trained to predict a displacement of the joint is applied in many studies on human pose estimation and

www.jenrs.com Journal of Engineering Research and Sciences, 1(3): 59-67, 2022 60


Van-Hung Le et al., An Evaluation of 2D Human Pose Estimation

Figure 1: The human instance segmentation model on the image based on the Mask R-CNN architecture. Mask R-CNN is generated based on the
combination of Faster R-CNN [17] and FCN.

Input (224x224) Output

ResNet

Figure 2: Illustrating the architecture of the RN for 2D-HPE.

gives good results [28]-[29]. That is the motivation for us layers), RN-101 (101 Conv layers) ), RN-152 (152 Conv lay-
to carry out this study to select a model with good 2D pose ers), as shown in Fig. 3.
estimation results. We compared results from different ver- RN is a DNN designed to work with hundreds or thou-
sions of the RN (RN-10, RN-18, RN-34, RN-50, RN-101, sands of convolutional layers. When building a CNN net-
RN-152) [14] to select the best results. work with many convolutional layers, the Vanishing Gradi-
Residual Network (ResNet/RN) was introduced in 2015 ent phenomenon occurs, leading to bad model training re-
and the 1st place in the 2015 ILSVRC challenge with an er- sults. The Vanishing Gradient phenomenon is presented as
ror rate is only 3.57%. Currently, there are many variations follows: The training process in DNN often uses Backprop-
of RN architecture with a different number of layers. The agation Algorithm [30]. The main idea of this algorithm
named RN is followed by a number indicating the RN ar- is that the output of the current layer is the input of the
chitecture with a certain number of layers. The RN have the next layer and computes the corresponding cost function
number with each version RN-10 (10 Conv layers), RN-18 gradient for each parameter (weight) of the network. The
(18 Conv layers), RN-34 (34 Conv layers), RN-50 (50 Conv Gradient Descent is then used to update those parameters.

www.jenrs.com Journal of Engineering Research and Sciences, 1(3): 59-67, 2022 61


Van-Hung Le et al., An Evaluation of 2D Human Pose Estimation

The above process will be repeated until the parameters a curved arrow starting at the beginning and ending at the
of the network are converged. Normally we would have end of the residual block as Fig. 4. In other words, it will
a hyper-parameter (the number of epochs - the number of add an input x value to the output of the layer, which will
times the training set is traversed once and the weights counteract the zero derivatives, since x is still added. With
updated) that defines the number of iterations to perform Hx being the predicted value, F x is the label, the desired
this process. If the number of loops is too small, then the output Hx to be equal to or approximately F x.
network may not give good results, and vice versa, the train- When the input of the network is the same as the output
ing time will be longer if the number of loops is too large. of the network, RN uses identity block, otherwise use the
However, in practice Gradients will often have smaller val- convolutional block, as presented Fig. 5.
ues at lower layers. As a result, the updates performed by
Gradients Descent do not change much of the weights of
those layers and make them not converged and the network
does not work well. This phenomenon is called "Vanishing
Gradients". RN proposed to use a uniform "identity short-
cut connection" connection to traverse one or more layers,
illustrated in Fig. 4.

Figure 5: Illustrating convolutional block of RN.

In this paper, RN is a backbone for 2D-HPE and feature


extraction. Recently, RN version 2 (v2) [14] is an improved
version of RN version 1 (v1) for classification performance.
The residual block [31] of RN v2 has two changes: A stack
of 1×1-3×3-1×1 at the steps BN, ReLU, Conv2D is used; the
Batch normalization and ReLU activation that comes before
2D convolution. Figure 6 shows the difference between RN
v1 and RN v2.

Figure 3: Illustrating of RN-152 architecture.

Figure 6: A comparison of residual blocks between RN v1 and RN v2 [31].

4. Experimental Results
4.1. Dataset
To fine-tune, generate and evaluate the model and the es-
timated model, we use the benchmark HU-3.6M-D [32].
HU-3.6M-D is the indoor dataset for the evaluation of 3D-
HPE from single-view of the cameras or multi-view of the
cameras(the data is collected in a Lab environment from
Figure 4: A Residual Block across two layers of RN. 4 different perspectives). This dataset is captured from 11
subjects/people (6 males and 5 females), the people per-
RN is like other CNNs, includes convolution, pooling, form with six types of action (upper body directions move-
activation, and fully-connected layer. In RN, there appears ment, full body upright variations, walking variations, vari-

www.jenrs.com Journal of Engineering Research and Sciences, 1(3): 59-67, 2022 62


Van-Hung Le et al., An Evaluation of 2D Human Pose Estimation

Figure 7: An illustration of human pose in HU-3.6M-D.

ations while seated on a chair, sitting on the floor, various where R and T are the rotation and translation parameters
movements) which includes 16 daily activities (directions, to transform from the R-WCS to the CCS. P 3Dw is the coor-
discussion, greeting, posing, purchases, taking photo, wait- dinate of the keypoint in the R-WCS. We also projected to
ing, walking, walking dog, walking pair, eating, phone 2D human pose annotation using Eq. 3.
talk, sitting, smoking, sitting down, miscellaneous). The
frames are captured from TOF (Time-of-Flight) cameras, P 3Dc .x ∗ f x
P 2D.x = + cx
the data frame rate of the cameras is from 25 to 50 Hz. This P 3Dc .z
(3)
dataset contains about 3.6 million images (1,464,216 frames P 3Dc .y ∗ f y
P 2D.y = + cy
for training - 5 people (2 female and 3 male), 646,180 frames P 3Dc .z
for validation - 2 people (1 female and 1 male), 1,467,684
frames for testing - 4 people (2 female and 2 male)), 3.6 mil- where P 2D is the coordinate of the keypoint in the im-
lion 3D human pose annotations captured by the marker- age.
based MoCap system. 3D human pose annotation of HU- The source code and HU-3.6M-D have 2D annotation
3.6M-D consists of 17 key points arranged in order as shown data as shown in link 1.
in Fig. 7. The authors have divided the HU-3.6M-D into 3 proto-
3D human pose annotations of HU-3.6M-D are anno- cols to train and test the estimation models. Protocol #1
tated based on the Mocap system. The coordinate system includes Subject #1, Subject #5, Subject #6, and Subject #7
of this data is the R-WCS. To evaluate the estimation re- for the training model, and Subject #9 and Subject #11 for
sults, we convert this data to the Camera Coordinate System the testing model. Protocol #2 is divided similarly to Proto-
(CCS). We based on the parameter set of the cameras and col #1. However, the predictions are further post-processed
the formula for converting data from 2D to 3D of Nicolas by a rigid transformation before comparing to the ground-
[33] by Eq. 1. truth. Protocol #3 includes Subject #1, Subject #5, Subject
#6, Subject #7, and Subject #9 for the training model, and
Subject #11 for the testing model. This dataset is saved in
xd − cx ∗ depthxd , yd
P 3Dc .x = path 2.
fx
yd − cy ∗ depthxd , yd (1)
P 3Dc .y =
fy
4.2. Implementation
P 3Dc .z = depthxd , yd The input data of our network includes color/RGB image
data and 2D human pose annotation. All images are resized
where f x, f y, cx and cy are the intrinsics of the depth
to (224 × 224) before being fed to the network.
camera. P 3Dc is the coordinate of the keypoint in the CCS.
In this paper, the loss function for training the estima-
Before evaluating the results of the 2D posture estima-
tion model includes two parts: L1 , L2 . We used the loss
tion, we re-projected the 3D human pose annotation from
function L1 and Adam optimizer for the training process.
the R-WCS to the CCS using Eq. 2.
First, we initialized the loss function L1 for 2D coordinates
P 3Dc = P 3Dw − T ∗ R−1 (2) predicted from RN. Then, we computed the loss function
1https://drive.google.com/drive/folders/1s3VmcZL8M2EK2M_Ese1-EWBVEZ5brM7Q?usp=sharing
2http://vision.imar.ro/human3.6m/

www.jenrs.com Journal of Engineering Research and Sciences, 1(3): 59-67, 2022 63


Van-Hung Le et al., An Evaluation of 2D Human Pose Estimation

L2 from the predicted 2D data. The loss function L of the


whole training process is calculated as Eq. 4.

L = α * L1 + β ∗ L2 (4)

We set α and β to 0.1 to bring the 2D error (in pixels) into


a similar range. The mean error was used to calculate the
loss functions. We trained each network for 10 epochs, with
the batch size being 32, Adam optimizer with the learning
rate being 0.001, the number of the worker being 4.
In this paper, we used a PC with GPU GTX 970, 4GB
for fine-tuning, training, testing the RN and its variations.
The source code of fine-tuning, training, testing and de- Figure 8: The average error between the 2D keypoint annotation and the
velopment process was developed in Python language estimated 2D keypoint of Protocol #1 on the validation set.
(≥3.6 version) with the support of the OpenCV-Python, Py-
torch/Torch (≥1.1 version), CUDA/cuDNN 11.2 libraries.
The average error (Erravg ) on the validation set follow-
In addition, the support of some other libraries is required
ing each epoch of Protocol #3 of the HU-3.6M-D is shown
such as Numpy, Scipy, Pillow, Cython, Matplotlib, Scikit-
in Fig. 9.
image, TensorFlow ≥ 1.3.0, Keras ≥ 2.0.8, H5py, Imgaug,
IPython. The source code for fine-tuning, training, testing
is shown in link 3.

4.3. Evaluation Measure


To evaluate 2D-HPE, we evaluate in two phases. The first
is to evaluate 2D human pose estimation results based on
Eq. 5. It is the average distance between the 2D keypoint of
the 2D ground-truth and the estimated 2D keypoint when
using the trained model based on RN, the distance is calcu-
lated as the L2 error value on the test set in pixels.

1 N1 J
Erravg Σ Σ L2pi , pei (5) Figure 9: The average error between the 2D keypoint annotation and the
N 1 J 1 estimated 2D keypoint of Protocol #3 on the validation set.

where N and J are the numbers of frames and number


of joints (J = 17) respectively, pei and pi are predicted and
Table 2: The average error (Erravg ) (IP 1) between the 2D keypoint anno-
ground-truth coordinates of ith joint of the hand, L2 is the tation and the estimated 2D keypoint of Protocol #1 of the HU-3.6M-D.
Euclidean distance between two points.
RN RN RN RN RN
CNNs
-10 -18 -50 -101 -152
/ Average
4.4. Results and discussions 34.96 38.58 669.11 628.95 652.34
Error
(Erravg )
The pre-trained model of RN and its variants are shown (Pixels)
in link 4. In this paper, we only evaluate the 2D-HPE on
Protocol #1 and Protocol #3 of the HU-3.6M-D. The average
error (Erravg ) between the 2D keypoint annotation and the In table 1, the average error of the RN-10 is 28.48 pixels,
estimated 2D keypoint of Protocol #1 of the HU-3.6M-D is which is the best result on Protocol #3. Figure 10 illustrates
shown in Tab. 2. a 2D-HPE result on the image. The blue skeleton is the
The average error (Erravg ) on the validation set follow- ground-truth of the human pose, the red skeleton is the
ing each epoch of Protocol #1 of the HU-3.6M-D is shown estimated human pose. When we do the training RN-10
in Fig. 8. with 50 epochs, the average error (Erravg ) on the test set
In table 2, the average error of the RN-10 is 34.96 pix- of Protocol #1 and Protocol #3 is 15.8 pixels, 12.4 pixels, re-
els, which is the best result on Protocol #1. The average spectively. Thus, increasing the number of training epochs
error (Erravg ) between the 2D keypoint annotation and the reduces the estimation error.
estimated 2D keypoint of Protocol #3 of the HU-3.6M-D is In this paper, the RN-10 has better results than RN-18,
shown in Tab. 1. RN-50, RN-101, RN-152 networks when training 10 epochs,
3https://drive.google.com/drive/folders/1-Hu2842xWDtZWBo762iT_viBYcuaAR7V?usp=sharing
4https://drive.google.com/drive/folders/1pXkTmHAjFDNK3VaFdcGH614LF8pG4QXM?usp=sharing

www.jenrs.com Journal of Engineering Research and Sciences, 1(3): 59-67, 2022 64


Van-Hung Le et al., An Evaluation of 2D Human Pose Estimation

Table 1: The average error (Erravg ) between the 2D keypoint annotation and the estimated 2D keypoint of Protocol #3 of the HU-3.6M-D.

RN RN RN RN RN
CNNs
-10 -18 -50 -101 -152
/ Average
28.48 29.35 578.99 602.44 593.48
Error
(Erravg )
(Pixels)

Figure 10: Illustrating a 2D-HPE result on the image of Protocol #1 on Subject #9, Subject #11.

Figure 11: Illustrating a 2D-HPE result on the image of Protocol #1 on Subject #9 of RN-10.

www.jenrs.com Journal of Engineering Research and Sciences, 1(3): 59-67, 2022 65


Van-Hung Le et al., An Evaluation of 2D Human Pose Estimation

RN-10 network is a smaller CNN than other networks, [8] J. Martinez, R. Hossain, J. Romero, J. J. Little, “A Simple Yet Effective
which proves that a smaller number of convolutional layers Baseline for 3d Human Pose Estimation”, “Proceedings of the IEEE
International Conference on Computer Vision”, vol. 2017-Octob, pp.
will make the network converge faster. This is also consis- 2659–2668, 2017.
tent with the explanation that smaller networks will learn
more efficiently than large CNNs [34]. [9] S. Li, L. Ke, K. Pratama, Y.-W. Tai, C.-K. Tang, K.-T. Cheng, “Cas-
caded deep monocular 3d human pose estimation with evolutionary
Figure 11 shows the result sequences of 2D-HPE of RN- training data”, “The IEEE/CVF Conference on Computer Vision and
10 on Protocol #1 of Subject #9. The estimated 2D keypoints Pattern Recognition (CVPR)”, 2020.
are blue-green nodes, the joints between the estimated 2D
[10] Q. Dang, J. Yin, B. Wang, W. Zheng, “Deep learning based 2D human
keypoints are the red lines. pose estimation: A survey”, TPAMI, vol. 24, no. 6, pp. 663–676, 2021,
doi:10.26599/TST.2018.9010100.

5. Conclusions and Future Works [11] Z. Luo, Z. Wang, Y. Huang, L. Wang, T. Tan, E. Zhou, “Rethinking
the Heatmap Regression for Bottom-up Human Pose Estimation”,
“CVPR”, pp. 13259–13268, 2021, doi:10.1109/cvpr46437.2021.01306.
In this paper, we have performed a comparative study for
2D-HPE based on versions of RN (RN-10, RN-18, RN-50, [12] A. Bulat, G. Tzimiropoulos, “Human pose estimation via convolu-
tional part heatmap regression”, “European Conference on Com-
RN-101, RN-152) on HU-3.6M-D with two evaluations Pro- puter Vision”, vol. 9911 LNCS, pp. 717–732, 2016, doi:10.1007/
tocols (Protocol #1, Protocol #3). We have transformed 3D 978-3-319-46478-7_44.
human pose annotation data to 2D human pose annotation.
[13] A. Newell, K. Yang, J. Deng, “Stacked Hourglass Networks for Hu-
The average error of the RN-10 is 34.96 pixels, 28.48 pixels, man Pose Estimation”, “ECCV”, 2016.
respectively, which is the best result on Protocol #1, Proto-
[14] K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for im-
col #3. The results are evaluated and shown in detail and
age recognition”, “IEEE Conference on CVPR”, vol. 2016-Decem, pp.
visually on the images. Therefore, RN-10 is a good CNN 770–778, 2016, doi:10.1109/CVPR.2016.90.
for estimating 2D human pose on images, this result can be
[15] R. Zhang, L. Du, Q. Xiao, J. Liu, “Comparison of Backbones for
used to estimate 3D human pose. In the future, we will use Semantic Segmentation Network”, “Journal of Physics: Conference
the human pose estimation results of RN-10 for 3D-HPE to Series”, vol. 1544, 2020, doi:10.1088/1742-6596/1544/1/012196.
compare with the studies of Zheng et al. [35] and Li et al.
[16] R. Girshick, “Fast R-CNN”, “Proceedings of the IEEE International
[9], which have the best results currently on the 3D-HPE. Conference on Computer Vision”, vol. 2015 Inter, pp. 1440–1448, 2015,
doi:10.1109/ICCV.2015.169.
Conflict of Interest: The paper is our research, not related [17] S. Ren, K. He, R. Girshick, J. Sun, “Faster R-CNN: Towards Real-Time
to any organization or individual. It is part of a series of Object Detection with Region Proposal Networks”, IEEE Transactions
studies by 2D, 3D human pose estimation. on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–
1149, 2017, doi:10.1109/TPAMI.2016.2577031.

Acknowledgement: This research is funded by Tan Trao [18] K. He, G. Gkioxari, P. Dollar, R. Girshick, “Mask R-CNN”, “ICCV”,
2017.
University in TuyenQuang, Viet Nam.
[19] A. Krizhevsky, I. Sutskever, G. E. Hinton, “Imagenet classification
with deep convolutional neural networks”, F. Pereira, C. J. C. Burges,
References L. Bottou, K. Q. Weinberger, eds., “Advances in Neural Information
Processing Systems”, vol. 25, Curran Associates, Inc., 2012.
[1] N. S. Willett, H. V. Shin, Z. Jin, W. Li, A. Finkelstein, “Pose2Pose:
Pose Selection and Transfer for 2D Character Animation”, “Interna- [20] M. Lin, Q. Chen, S. Yan, “Network in network”, “2nd International
tional Conference on Intelligent User Interfaces, Proceedings IUI”, Conference on Learning Representations, ICLR 2014 - Conference
pp. 88–99, 2020, doi:10.1145/3377325.3377505. Track Proceedings”, pp. 1–10, 2014.

[2] H. Zhang, C. Sciutto, M. Agrawala, K. Fatahalian, “Vid2Player: Con- [21] K. Simonyan, A. Zisserman, “Very deep convolutional networks for
trollable Video Sprites That Behave and Appear Like Professional large-scale image recognition”, Y. Bengio, Y. LeCun, eds., “3rd Inter-
Tennis Players”, ACM Transactions on Graphics, vol. 40, no. 3, pp. 1–16, national Conference on Learning Representations, ICLR 2015, San
2021, doi:10.1145/3448978. Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings”, 2015.

[22] A. Toshev, C. Szegedy, “DeepPose: Human Pose Estimation via Deep


[3] H. G. Weiming Chen , Zijie Jiang, X. Ni, “Fall Detection Based on Key
Neural Networks”, “IEEE Conference on CVPR”, 2014.
Points of of human-skeleton using openpose”, Symmetry, 2020.
[23] S. Liang, X. Sun, Y. Wei, “Compositional Human Pose Regression”,
[4] N. T. Thanh, L. V. Hung, d. . m. i. . . j. . J. n. . . p. . . t. . A. v. . . y. . . “ICCV”, vol. 176-177, pp. 1–8, 2017, doi:10.1016/j.cviu.2018.10.006.
Cong, Pham Thanh.
[24] D. C. Luvizon, H. Tabia, D. Picard, “Human pose regression
[5] X. Zhou, Q. Huang, X. Sun, X. Xue, Y. Wei, “Towards 3d human pose by combining indirect part detection and contextual information”,
estimation in the wild: A weakly-supervised approach”, “The IEEE Computers and Graphics (Pergamon), vol. 85, pp. 15–22, 2019, doi:
International Conference on Computer Vision (ICCV)”, 2017. 10.1016/j.cag.2019.09.002.
[6] G. Chandrasekaran, S. Periyasamy, K. Panjappagounder Rajaman- [25] Z. Cao, T. Simon, S. E. Wei, Y. Sheikh, “Realtime multi-person 2D pose
ickam, “Minimization of test time in system on chip using arti- estimation using part affinity fields”, “IEEE Conference on CVPR”,
ficial intelligence-based test scheduling techniques”, Neural Com- vol. 2017-Janua, pp. 1302–1310, 2017, doi:10.1109/CVPR.2017.143.
puting and Applications, vol. 32, no. 9, pp. 5303–5312, 2020, doi:
10.1007/s00521-019-04039-6. [26] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dol-
lár, C. L. Zitnick, “Microsoft COCO: Common objects in context”,
[7] G. Chandrasekaran, P. R. Karthikeyan, N. S. Kumar, V. Kumarasamy, “Lecture Notes in Computer Science (including subseries Lecture
“Test scheduling of System-on-Chip using Dragonfly and Ant Lion Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)”,
optimization algorithms”, Journal of Intelligent and Fuzzy Systems, vol. 8693 LNCS, pp. 740–755, 2014, doi:10.1007/978-3-319-10602-1_
vol. 40, no. 3, pp. 4905–4917, 2021, doi:10.3233/JIFS-201691. 48.

www.jenrs.com Journal of Engineering Research and Sciences, 1(3): 59-67, 2022 66


Van-Hung Le et al., An Evaluation of 2D Human Pose Estimation

[27] M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, “2d human pose Currently, he is a lecture of Tan Trao
estimation: New benchmark and state of the art analysis”, “IEEE University. His research interests in-
Conference on Computer Vision and Pattern Recognition (CVPR)”,
2014.
clude Computer vision, RANSAC and
RANSAC variation and 3-D object detection, recognition;
[28] X. Xiao, W. Wan, “Human pose estimation via improved ResNet50”, machine leaning, deep learning.
“4th International Conference on Smart and Sustainable City (ICSSC
2017)”, vol. 148, pp. 148–162.

[29] Y. Wang, T. Wang, “Cycle Fusion Network for Multi-Person Pose Es-
timation”, Journal of Physics: Conference Series, vol. 1550, no. 3, 2020.
Trung-Minh Bui received Bachelor de-
gree at Thainguyen of Information
[30] N. Benvenuto, F. Piazza, “On the Complex Backpropagation Algo- and Comunication Technology (ICTU)
rithm”, IEEE Transactions on Signal Processing, vol. 40, no. 4, pp. 967–
969, 1992, doi:10.1109/78.127967.
(2010). He received M.Sc. degree at
Thainguyen of Information and Comu-
[31] N. V. Hieu, N. L. H. Hien, “Recognition of plant species using deep nication Technology (ICTU) (2014). Cur-
convolutional feature extraction”, International Journal on Emerging rently, he is a lecture of Tan Trao Univer-
Technologies, vol. 11, no. 3, pp. 904–910, 2020.
sity. His research interests include Com-
[32] C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu, “Human3.6m: puter science; machine leaning, deep
Large scale datasets and predictive methods for 3d human sensing learning.
in natural environments”, TPAMI, vol. 36, no. 7, pp. 1325–1339, 2014.

[33] N. burrus, “Kinect calibration”, http://nicolas.burrus.name/


index.php/Research/KinectCalibration.

[34] X. Zhang, X. Zhou, M. Lin, J. Sun, “ShuffleNet: An Extremely Effi-


cient Convolutional Neural Network for Mobile Devices”, “CVPR”,
pp. 6848–6856, 2018. Hai-Yen Tran Faculty Information Tech-
nology National Economics University
[35] C. Zheng, S. Zhu, M. Mendieta, T. Yang, C. Chen, Z. Ding, “3d hu-
(2009). She received M.Sc. degree at
man pose estimation with spatial and temporal transformers”, “Pro-
ceedings of the IEEE International Conference on Computer Vision Faculty Information Technology Hanoi
(ICCV)”, 2021. National University of Education (2013).
Currently, she is a lecture of Vietnam
Copyright: This work is licensed under a Creative Academy of Dance. Her research in-
Commons Attribution 4.0 License. For more informa- terests include computer science, deep
tion, see https://creativecommons.org/licenses/by/4.0/ learning.

Van-Hung Le received M.Sc. degree at Thi-Loan Pham received Bachelor de-


Faculty Information Technology Hanoi gree at Faculty Information Technology Hanoi Pedagogical
National University of Education (2013). University 2 (2007). She received M.Sc. degree at Univer-
He received PhD degree at Interna- sity of Engineering and Technology (2012). Currently, she
tional Research Institute MICA HUSTC- is a lecture of College of HaiDuong. Her research interests
NRS/UMI - 2954 - INP Grenoble (2018). include computer science, deep learning.

www.jenrs.com Journal of Engineering Research and Sciences, 1(3): 59-67, 2022 67

You might also like