Human Pose Estimation Using MediaPipe Pose and Opt
Human Pose Estimation Using MediaPipe Pose and Opt
sciences
Article
Human Pose Estimation Using MediaPipe Pose and
Optimization Method Based on a Humanoid Model
Jong-Wook Kim , Jin-Young Choi, Eun-Ju Ha and Jae-Ho Choi *
Department of Electronics Engineering, Seunghak Campus, Dong-A University, Busan 49315, Republic of Korea
* Correspondence: [email protected]
Abstract: Seniors who live alone at home are at risk of falling and injuring themselves and, thus,
may need a mobile robot that monitors and recognizes their poses automatically. Even though deep
learning methods are actively evolving in this area, they have limitations in estimating poses that are
absent or rare in training datasets. For a lightweight approach, an off-the-shelf 2D pose estimation
method, a more sophisticated humanoid model, and a fast optimization method are combined to
estimate joint angles for 3D pose estimation. As a novel idea, the depth ambiguity problem of 3D
pose estimation is solved by adding a loss function deviation of the center of mass from the center of
the supporting feet and penalty functions concerning appropriate joint angle rotation range. To verify
the proposed pose estimation method, six daily poses were estimated with a mean joint coordinate
difference of 0.097 m and an average angle difference per joint of 10.017 degrees. In addition, to
confirm practicality, videos of exercise activities and a scene of a person falling were filmed, and the
joint angle trajectories were produced as the 3D estimation results. The optimized execution time
per frame was measured at 0.033 s on a single-board computer (SBC) without GPU, showing the
feasibility of the proposed method as a real-time system.
Keywords: human pose estimation; humanoid robot; global optimization method; MediaPipe
Pose; uDEAS
the authors decided that the most realistic and simplest way would be to photograph a
subject with a 2D camera and estimate each joint angle using a fast optimization algorithm
based on a 3D humanoid model. This technology belongs to the field of 3D human pose
estimation with monocular image or video.
Human pose estimation technology is being actively researched around the world
in the areas of sports, surveillance, work monitoring, home elderly care, home training,
entertainment, gesture control, and even metaverse avatar. In general, human pose esti-
mation is classified into 2D and 3D coordinate estimation methods, single-person- and
multiple-person-based methods according to the number of target subjects, monocular
image- and multi-view image-based methods according to the number of shooting cameras,
and single-image- and video-based methods according to the input type [7–11].
Specifically, according to the structure of deep learning process, human pose estimation
is classified into single-stage methods and two-stage methods. The single-stage methods
that directly map input images to 3D body joint coordinates can be categorized into
two classes: detection-based methods [12,13] and regression-based methods [14,15]. The
detection-based methods predict a likelihood heatmap for each joint which location is
determined by taking the maximum likelihood of the heatmap, while the regression-based
methods directly estimate the location of joints relative to the root joint location [14] or the
angle of joints by introducing a kinematic model consisting of several joints and bones [15].
Because 2D pose estimation has a greater number of in-the-wild datasets with ground-
truth joint coordinates than 3D pose estimation, two-stage methods of leveraging 2D pose
estimation findings for 3D human pose estimation, also known as lifting from 2D to 3D, are
being developed extensively [16]. The relationships between joints have been exploited
by long short-term memory (LSTM) [17], and generative adversarial networks (GANs) are
often used to produce a more realistic 3D human pose [18].
In spite of continuous technological advances, the deep learning methods of 3D pose
estimation from 2D images should solve challenging problems, including a lack of in-the-
wild datasets, a huge demand for various posture data, depth ambiguities, and a large
searching state space for each joint [9]. Furthermore, a high-performance PC equipped
with many GPUs is essential for executing deep learning packages.
Firstly, collecting a large in-the-wild dataset with 3D annotation is very intensive,
and thus building the popular datasets of HumanEva [19] and Human3.6M [20] requires
expensive motion capture systems and many subjects and experiments. Secondly, human
pose has an infinite number of variants according to camera translation, body orientation,
differences in height and body part ratio, etc. Thirdly, the depth ambiguity problem
arises because different 3D poses can be mapped to a single 2D pose, which is known
to be mitigated using temporal information from a series of images or multi-viewpoint
images [21]. Lastly, the requirement for at least 17-dimensional joint space is also a high-
order problem to solve with conventional optimization methods, and thus optimization
results are time consuming and unacceptable in precision.
In this paper, to run a human pose estimation package on an SBC installed in a
mobile robot, a new type of two-stage pose estimation method is proposed. The first stage
of 2D pose estimation is performed with MediaPipe Pose [22], and the second stage of
estimating joint angles is carried out with a fast optimization method, uDEAS [23] based
on an elaborate humanoid model. We propose a 3D full-body humanoid model whose
reference coordinate frame is located at the center of the pelvis, i.e., root joint, and three DoF
(Degree of Freedom) lumbar joints are newly added to the center. The three lumbar joints
of twist, flexion, and lateral flexion can make poses in which only the upper body rotates or
bends, and thus they are indispensable joints necessary to create various natural poses, such
as yoga, sitting, and lying, amongst others. However, some recent deep learning methods
lack these core joints when modeling the human body [5,6]. In addition, the joint rotation
polarity rules for all the humanoid joints are designed to be consistent with those of the
Vicon motion capture system [24], which bridges numerous physiological research results
on various activities of the human body using Vicon data [25]. An innovative method for
Appl. Sci. 2023, 13, 2700 3 of 21
resolving the inverse kinematics of the humanoid is to use uDEAS to tune 19 unknown
pose-relevant variables for each frame of real-time pose estimation with camera-based or
video-based images. This allows the humanoid joint angles to fit the 2D humanoid model
that is reprojected from the 3D model to the MPP skeleton as closely as possible.
The proposed approach does not require the first problem of a significant amount of
human pose data, and a full set of optimization variables is constructed for resolving the
second pose variation problem. The third problem of depth ambiguity can be addressed
by adding deviation of center of mass from the center of the supporting feet as well as
appropriate penalty functions concerning an allowed range of joint angle rotation. The
fourth problem can be overcome by employing a fast optimization method, such as uDEAS.
For the validation of the proposed approach, several ADL poses were attained by
simulation and experiment. We generated simulations of human poses using a humanoid
model and given joint angles as ground-truth data, and we allowed uDEAS to estimate the
true joint angles by taking simulated poses as the input. In order to check for practicality,
gymnastics motion and sudden fall motion were filmed with a camera. The 3D pose
estimation results and the obtained joint trajectories were acceptable for application to
mobile robots that monitor poses. Unfortunately, most state-of-the-art deep learning
methods require CUDA-relevant libraries and a GPU hardware, and we could not apply
them to our small mobile robot.
The contributions of this paper are presented below:
• In order to simulate and estimate a human-like pose, a full-body humanoid robot
model with lumbar joints was constructed including effects of camera view angle
and distance.
• Instead of solving the inverse kinematics of a humanoid for a given 2D skeletal model,
the heuristic optimization method uDEAS directly adjusts the camera-relative body
angles and intra-body joint angles to match the 2D projected humanoid model to the
2D skeletal model.
• The depth ambiguity problem can be solved by adding a loss function deviation of
center of mass from the center of the supporting foot (feet) and appropriate penalty
functions for the ranges of natural joint angle rotations.
• The proposed 3D human body pose estimation system showed an average perfor-
mance of 0.33 s per frame using an inexpensive SBC without GPU.
• We find that rare poses resulting from falling activity were well estimated in the
present system. This may be difficult with deep learning methods due to the lack of
training data.
This paper is organized into five sections. Section 2 briefly describes the methods
that comprise the proposed pose estimation system. Section 3 explains the structure of
the proposed system, and Section 4 describes the experimental results when applying our
system to several representative poses. Section 5 concludes the present work and discusses
future work.
kn, and an indicate joint names of the head, shoulder, elbow, torso, hip, knee, and ankle
respectively, and the superscripts l and r denote the left and right parts, respectively.
θ hd
x24
z24
lhd x23
z23 −ψ hd
x18
lnk x23
z22
z18 x22 x14
x19 −φ r
x15
sh lsh
lua z14 −φshl
z15
lla θ shr x 19
z19
x15
x21 θ elr x ltr θ shl
z21 20 z20 z t
lua z16
tr
zH
z0 yH
xH
z0 z0
x7 x0
−ψ l
ψ hpr hp
x7 y0 lpv x1
z7
x8
−φ r
hp x1
z1
l fe x2 φhpl
z8
θ hpr x8
x9 z2
x10 z9 l fe x2
θ knr
ltb θ hpl
−θ anr z10 x3
x10 x11 z3
ltb
θ l
kn
z11
x11 x12
x5
φanr lca θ anl x4 z4
z5 x x6
z12 x12 5
−φanl lca
z6
x6
zzcc
zzhh
θθbdbd
φφbd yyc c
bd
xxhh yh
yh
−ψ bd xc
−ψ bd xc
Figure 4. Relationship between the body coordinate fame, xh yh zh , and the camera-based coordinate
frame, xc yc zc .
Appl. Sci. 2023, 13, 2700 7 of 21
Then, these strings are decoded into real values and replaced with the jth variable of
the current best optimization variable vector v∗ as follows:
v− = v+ = v∗ , v− ( j ) = d r− + +
j , v ( j) = d r j (3)
Next, compute cost values J (v− ) and J (v+ ). If J (v− ) < J (v+ ), the direction for the
UDS is set as u( j) = −1; otherwise, u( j) = 1. The better row is saved as r∗j .
v − = v + = v * , v − ( j ) = d ( r j− ) , v + ( j ) = d ( r j+ ) (3)
j ( M ) = rj u
((Mj)), +perform
*
• Step 4. UDS: Depending on therdirection u ( j ) addition or subtraction to the
(4)
jth row, which is described as
Check whether the new row rj contributes to a further reduction of the loss func-
∗
tion. If so, the current binary string ) = rthe
r j (Mand j (M ) + u( j)are updated as the optimal ones
variable (4)
as follows, and go to Step 4.
Check whether the new row r j contributes to a further reduction of the loss function.
j , v ( j) = d (rj )
r*j = rthe * *
If so, the current binary string and variable are updated as the optimal ones (5) as
follows, and go to Step 4.
Otherwise, go to Step 5.
r∗j = r j ,v∗ ( j) = d r∗j (5)
• Step 5. Save the resultant UDS best string, r ( M ) , into the jth row of the current
*
Otherwise, go to Step 5.
• best matrix.
Step 5. Save the resultant UDS best string, r∗ (M), into the jth row of the current
• Step 6. If i < n , set i = i +1. Go to Step 3. Otherwise, if the current string length m is
best matrix.
• Step 6. Ifthan
shorter the
i < n, setprescribed maximal
i = i + 1. Go to Step row length miff the
3. Otherwise, i = 1 , string
, setcurrent increase the m
length row
is
length index as m = m + 1 , and go to Step 2. In thef case of m = m f , go to Step 7.
shorter than the prescribed maximal row length m , set i = 1, increase the row length
index as m = m + 1, and go to Step 2. In the case of m = m f , go to Step 7.
• Step 7. If the number of restarts is less than the specified value, go to Step 1. Other-
• Step
wise,7.terminate
If the number of restarts
the current is search
local less than the specified
routine and choosevalue,
thegoglobal
to Step 1. Otherwise,
minimum with
terminate the current local search routine and
the smallest cost value among the local minima found so far. choose the global minimum with the
smallest cost value among the local minima found so far.
3. Proposed Pose Estimation Algorithm
3. Proposed Pose Estimation Algorithm
Pose Estimation
Pose Estimation Process
Process
Figure 55 shows
Figure shows the
the overall
overall flow
flow diagram
diagramof
ofthe
theproposed
proposedpose
poseestimation
estimationsystem.
system.
Figure 5. Flow
Figure 5. Flow diagram
diagram of
of the
the proposed
proposed pose
pose estimation
estimation process.
process.
• Step 1. Calibration of link length: Our system checks whether the human subject is a
new user or not because the subject’s bone length information is basically necessary
for the model-based pose estimation. If the present system has no link length data for
the current subject, the link length measurement process begins; the subject stands
with the arms stretched down, images are captured for at least 10 frames, and the
length of each bone link is calculated as the average distance between the coordinates
of the end joints of the bone at each frame.
Appl. Sci. 2023, 13, 2700 9 of 21
• Step 2. Acquire images from an RGB camera with an image grabber module of SBC.
Although an Intel RealSense camera is used in the present system, commercial RGB
webcams are also available.
• Step 3. Execute MPP and obtain 2D pixel coordinates of the 17 landmarks for the
captured human body.
• Step 4. Execute uDEAS to seek for unknown pose-relevant variables, such as the cam-
era’s distance factor and viewing angles, and the intrabody joint angles by reducing
the loss function formulated with the L2 norm between the joint coordinates obtained
with MPP and those reprojected onto the corresponding 2D plane.
• Step 5. Plot the estimated poses in 2D or 3D depending on the application field.
• Step 6. If the current image frame is the last one or a termination condition is met, stop
the pose estimation process. Otherwise, go to Step 2.
For an estimation of arbitrary poses at each frame, the size factor, γ, and the three
body angle values related to the camera view angle, such as θbd , φbd , and ψbd mentioned
in Section 2.3, are added to the list of optimization variables. Therefore, a complete
optimization vector for pose estimation consisting of 19 variables is estimated as follows:
h iT
l l r r l
V = γ, θbd , φbd , ψbd , θtr , φtr , ψtr , θhp , θkn , θhp , θkn , θsh , θell , θsh
r r
, θel l
, φhp r
, φhp l
, φsh r
, φsh (6)
xcCoM CoM is the coordinate of the humanoid’s CoM projected onto the floor;
where
, yc
i, f t i, f t
xc , yc , i = l, r denotes the center of the floor coordinates of the left and right feet;
and lleg denotes the length of the leg attained by summing the three links in the leg, i.e.,
lleg = l f e + ltb + lca , which is necessary for normalization.
where (x CoM
c , ycCoM ) is the coordinate of the humanoid’s CoM projected onto the floo
(x i , ft
c , yci , ft ) , i = l , r denotes the center of the floor coordinates of the left and right feet; an
lleg denotes the length of the leg attained by summing the three links in the leg, i.
Appl. Sci. 2023, 13, 2700 10 of 21
lleg = l fe + ltb + lca , which is necessary for normalization.
Initial Humanoid
robot model
γ γ
γ γ
Figure 6. Pose
Figure 6. Posecomparison
comparison in the
in the 2D 2D frontal
frontal planeplane between
between thepixel
the MPP MPP pixel(solid
model modelline)(solid line) and t
and the
fitted model
fitted model(dashed line)
(dashed contracted
line) fromfrom
contracted the initial humanoid
the initial model (dash-dotted
humanoid line).
model (dash-dotted line).
As the third metric of loss function, penalty functions concerning the suitable bounds
As the third metric of loss function, penalty functions concerning the suitable boun
of some joint angles are newly proposed to make it possible to find the best fit among
of someidentical
several joint angles arefornewly
3D poses proposed
an estimated to make
MPP pose. Figureit 7possible to find
shows three typesthe best fit amon
of penalty
several
functionsidentical
for joint3D poses
angles. forsingle-sided
The an estimated MPP(positive)
negative pose. Figure 7 shows
penalty three
function types
Psn (P sp ) of pe
alty
and functions for joint
the double-sided angles.
penalty The single-sided
function as follows:(positive) penalty function Ps
Pd are defined negative
Psp ) and the double-sided penalty function ( P are defined as follows:
|θ − σn |, d θ < σn
Psn (θ, σn ) = (9)
0, θ ≥ σn
θ −σn ,θ < σn
Psn (θ(, σ n ) = (
θ − σp ,0, θ > σθp ≥ σ n
Psp (θ, σp ) = (10)
0, θ ≤ σp
θ − σ p , θ > σ p
Psp(θ , σ
p |)θ=−σn |, θ < σn
(1
Pd θ, σn , σp = θ − p , θ ≥ σθp ≤ σ p
σ0, (11)
0, σn ≤ θ < σp
P
Psn sn PspPsp P
Pd d
σ θ σ θ σ σ θ
σn n θ σp p θ σn n σp p θ
Figure 8.
Figure 8. Human pose recognition
recognition system.
system.
Figure 8. Human pose recognition system.
The
The basic
basic configuration
configuration of
of uDEAS
uDEAS isis determined
determined as
asfollows:
follows:
The basic configuration of uDEAS is determined as follows:
•• Number of optimization variables:
Number of optimization variables: 19.19.
• Number of optimization variables: 19.
• Initial row length: 3.
• Maximum row length: 12.
• Number of maximum restarts: 20.
For pose estimation with a single image or an initial image of a video, the search
ranges of each variable need to be determined appropriately, i.e., not too wide for search
efficiency and not too narrow for inclusion of global minima. Table 1 summarizes the
upper and lower bounds of the optimization variables in which all the joint variables can
generate six ADL poses. In pose estimation with simulated poses where the ground-truth
joint angles are known, the size factor is set as 1.0. Among the joint variables, the knee
and elbow joints must have only positive angles, and, thus, their lower bounds are all set
as zero.
For pose estimation with a sequence of images, the search ranges are adjusted around
the optimal variables found by uDEAS in the previous image frame. In this paper, the upper
limit of the optimization variable was determined by adding 10 degrees to the optimal
variable of the previous frame, and the lower limit was set by subtracting 10 degrees from
this. In this case, when the minimum rotation angle must be 0 degrees, such as the elbow
γ θ bd φbd ψ bd θ tr φtr ψ tr θhpl θ knl θhpr θ knr θ shl θ ell θ shr θelr φhpl φhpr φshl φ
V 1 10 10 90 90 40 30 90 90 90 90 180 90 180 90 40 40 40
V Appl. Sci.
1 2023,
−10 −10
13, 2700 −90 −20 −40 −30 −20 0 −20 0 −180 0 −180 0 −40 −40
12 of 21 −40 −
For pose estimation with a sequence of images, the search ranges are adjusted aro
the optimal
or knee variables
joint, care must befound byensure
taken to uDEAS thatin
thethe previous
adjusted image
search’s frame.
lower In this
limit does not paper,
fall below
upper limit0 degrees.
of the optimization variable was determined by adding 10 degrees to the
timal variable of the previous frame, and the lower limit was set by subtracting 10 deg
Table 1. Upper and lower bounds of the optimization variables (in degrees).
from this. In this case, when the minimum rotation angle must be 0 degrees, such as
γ θbd φbd ψbd elbow
θtr ortr knee
φ ψtr joint,
θlhp care
θlkn must
θrhp be
θrkntaken
θlsh toθensure
l
el θrsh that
θrel the ladjusted
φhp r
φhp φsearch's
l
sh
r
φsh lower l
V 1 10 10 90
does
90
not
40
fall30
below90
0 degrees.
90 90 90 180 90 180 90 40 40 40 40
V 1 −10 −10 −90 −20 −40 −30 −20 0 −20 0 −180 0 −180 0 −40 −40 −40 −40
4.1. Pose Estimation with Simulation Data
ForEstimation
4.1. Pose the performance validation
with Simulation Data of the proposed approach, we generated six po
(1) stand and
For the raise arms
performance front; (2)
validation of stand and raise
the proposed arms we
approach, up;generated
(3) walksix
with the left le
poses;
front; (4) bend down and grab an object; (5) sit on a chair; and (6) kneel
(1) stand and raise arms front; (2) stand and raise arms up; (3) walk with the left leg in down.
front;Figure
(4) bend9down
shows that
and grabthe
anloss function
object; (5) sit onprofiles gradually
a chair; and (6) kneelminimize
down. during 20 rest
Figure 9 shows that the loss function profiles gradually minimize during
for a given pose. It is apparent that the loss functions converge when the row lengt 20 restarts
for auDEAS
the given pose. It is apparent
matrices that the loss functions converge when the row length of the
reach 10.
uDEAS matrices reach 10.
0
10
-1
10
current best cost
-2
10
-3
10
3 4 5 6 7 8 9 10 11 12
row length
Figure 9. Minimization aspect of loss function during 20 restarts of uDEAS with each restart
Figure 9. Minimization aspect of loss function during 20 restarts of uDEAS with each restart co
colored differently.
ored differently.
Figure 10 shows that uDEAS successfully estimates the six poses under the current
Figure configuration.
optimization 10 shows that uDEAS
Table successfully
2 lists the estimates
MPJPE of each the six in
global minimum poses under
Equation (7),the cur
optimization configuration. Table 2 lists the MPJPE of each global minimum in Equa
the average angular difference between the 18 ground-truth joint angles in Equation (6),
and the
(7), the estimated ones for the
average angular six poses.between
difference The average
theMPJPE is 0.097 m, and
18 ground-truth theangles
joint averagein Equa
angle difference per joint is 10.017 degrees, which is an acceptable result for pose estimation
with a full-scale humanoid model. The body size was selected referring to average Korean
women in their twenties [34]. It is worth noting that all poses were created with the body
transverse angles (ψbd ) between 20 and 40 degrees for generalization, which indicated that
the camera view angle was set to the side of the subject.
Table 2. MPJPE and average angular difference values for the best-fitted poses in Figure 10.
Pose 1 2 3 4 5 6 Avg.
MPJPE (m) 0.0055 0.0099 0.0111 0.0049 0.0150 0.0116 0.097
Avg. ang. diff (deg) 6.061 7.748 10.557 5.6 14.558 15.58 10.017
(6), and the estimated ones for the six poses. The average MPJPE is 0.097 m, and the aver-
age angle difference per joint is 10.017 degrees, which is an acceptable result for pose es-
timation with a full-scale humanoid model. The body size was selected referring to aver-
age Korean women in their twenties [34]. It is worth noting that all poses were created
with the body transverse angles (ψ bd ) between 20 and 40 degrees for generalization, which
Appl. Sci. 2023, 13, 2700 13 of 21
indicated that the camera view angle was set to the side of the subject.
Figure
Figure10.10.Pose
Poseestimation
estimation results
results for
for the six representative poses
poses generated
generatedby bythe
thehumanoid
humanoidmodel
model (solid line: true pose, dotted line: estimated pose, red line: right parts, blue line: left parts).
(solid line: true pose, dotted line: estimated pose, red line: right parts, blue line: left parts).
where the weights γ CoM and γ cor _ sh _ r denote the effects of each term on the loss function
to reflect the adequacy of the reconstructed 3D pose. Because the MPJPE was in the order
Appl. Sci. 2023, 13, 2700 −3 14 of 21
of 10 , these weights needed to be 0.01. The threshold values of the penalty functions
can be selected appropriately for a target pose and activity.
Figure
The loss12function
shows the trajectories
needs to reflectof the
the estimated
pose match, joint angles
pose for thepose
stability, poses from Figure
symmetry, and
11a–c. It is noteworthy that the sagittal hip and knee joints move
penalty values for the joint angles of the torso and both shoulders as follows: from 0 to around 100
degrees, and the sagittal torso angles change from 0 to 60 degrees, which match the actual
human joint angles rather well. These angle trajectories also provide information on the
current
L(v) =stance,
MJCDmaking
(v) + γthem
CoM helpful
CoMD ( v ) for
+ γ medical
sym θ l −
hp or
θ rtherapeutic
hp + θ kn kn + γsag_tr Psn (θtr , −10◦ )
l − θ rpurposes.
+γcor_tr Pd (φtr , −10◦ , 10◦ ) + γtran_tr Pd (ψtr , −15◦ , 15◦ ) + γsag_sh_l Psn θshl , −10◦
r , −10◦ + γ l ◦ +γ l ◦
+γsag_sh_r Psn θsh cor_sh_l Psp φsh , 20 cor_sh_r Psp φsh , 20
where the weights γCoM and γcor_sh_r denote the effects of each term on the loss function to
reflect the adequacy of the reconstructed 3D pose. Because the MPJPE was in the order of
10−3 , these weights needed to be 0.01. The threshold values of the penalty functions can be
selected appropriately for a target pose and activity.
(a) (b) (c)
Figure 11. Pose estimation results for three poses generated by the humanoid model. (a) Original
images and MPP results overlaid, (b) comparison between the MPP model (solid line) and the fitted
Appl. Sci. 2023, 13, 2700 humanoid model (dotted line) in 2D plane, and (c) the humanoid model reconstructed in 3D15 plane
of 21
(red line: right parts, blue line: left parts).
Figure 12 shows the trajectories of the estimated joint angles for the poses from Figure
Figure 12 shows the trajectories of the estimated joint angles for the poses from
11a–c. It is noteworthy that the sagittal hip and knee joints move from 0 to around 100
Figure 11a–c. It is noteworthy that the sagittal hip and knee joints move from 0 to around
degrees, and the sagittal torso angles change from 0 to 60 degrees, which match the actual
100 degrees, and the sagittal torso angles change from 0 to 60 degrees, which match the
human joint angles rather well. These angle trajectories also provide information on the
actual human joint angles rather well. These angle trajectories also provide information on
current stance, making them helpful for medical or therapeutic purposes.
the current stance, making them helpful for medical or therapeutic purposes.
(a) (b)
Figure
Figure 12.
12. Trajectories
Trajectories of the estimated
estimated joint
joint angles
angles(degree)
(degree)atat(a)
(a)the
thetorso
torsoand
and(b)
(b)the
thesagittal
sagittalplane.
plane.
For real-time application of the proposed pose optimization system, the execution time
needsFor to be measured
real-time while running
application uDEAS
of the on NUC.
proposed pose Table 3 lists the mean
optimization run the
system, timeexecution
per frame
measured
time needswhile changing the
to be measured whilenumber
runningof restarts
uDEASand the maximum
on NUC. rowthe
Table 3 lists length
meanofrun uDEAS.
time
Interestingly,
per frame measured the optimal
whilerow length the
changing is found
numberto beof six whenand
restarts number of restartsrow
the maximum is set at 10
length
because the loss of the next row length (7.22 × 10 −3 ) increases significantly. In the same manner,
of uDEAS. Interestingly, the optimal row length is found to be six when number of restarts
the
is setoptimal numberthe
at 10 because of restarts is 6 next
loss of the whenrow the length
optimal( 7.22 10 −3 ) increases
row×length is 12. As asignificantly.
combinationIn of
these results, the loss 7.04 ×the − 3
the same manner, theand meannumber
optimal run timeofper frameisare
restarts 6 when 10 optimal
and 0.033
rows,length
respectively,
is 12.
in the case when the number of restarts and the maximum row length are both six. Since −3
this
As a combination of these results, the loss and mean run time per frame are 7.04 × 10
optimization
and execution timeinisthe
0.033 s, respectively, belowcase100 ms, i.e.,
when thecamera-capturing
number of restarts time forthe
and eachmaximum
frame at 10row fps,
it is likely that real-time pose estimation is possible with the proposed
length are both six. Since this optimization execution time is below 100ms, i.e., camera- system.
capturing time for each frame at 10fps, it is likely that real-time pose estimation is possible
Table 3. Comparison of uDEAS run time measured in NUC (bold: optimal configuration).
with the proposed system.
Figure 13 shows Max.
No. Restart various
Row 2D poses and Loss
Length the (corresponding
×10−3 ) 3DRun
Avg. poses
Timereconstructed
Per Frame (s)
from the 2D poses using the camera 12
images and MPP.
6.52
It can be seen that the reconstructed
0.180
11 6.63 0.165
10 6.63 0.137
9 6.54 0.118
10 8 6.74 0.096
7 6.72 0.078
6 6.94 0.062
5 7.22 0.044
4 12.89 0.028
9 6.65 0.170
8 6.73 0.149
7 6.68 0.130
12
6 6.98 0.113
5 7.15 0.096
4 7.98 0.079
6 6 7.04 0.033
7 6.72 0.078
6 6.94 0.062
5 7.22 0.044
4 12.89 0.028
9 6.65 0.170
Appl. Sci. 2023, 13, 2700 16 of 21
8 6.73 0.149
7 6.68 0.130
12
6 6.98 0.113
Figure 13 shows various 2D poses and the corresponding 3D poses reconstructed
5 7.15 0.096
from the 2D poses using the camera images and MPP. It can be seen that the reconstructed
4 7.98 0.079
humanoid models are similar to the actual 3D poses owing to the loss function that reflects
6 6
many conditions related to the stable poses. 7.04 0.033
To
Toverify
verify the
the activity
activity estimation
estimation performance,aasudden
performance, suddenfallfallmotion
motion estimation was
To verify the activity estimation performance, a sudden fall motion estimationestimation
was was
attempted,
attempted, as
as shown
shown in
in Figures
Figures 14
14 and
and 15.
15. The
The images
images captured
captured
attempted, as shown in Figures 14 and 15. The images captured in Figure 14 and the poses
in
in Figure
Figure 1414 and
and the
the poses
poses
of the
thehumanoid
ofhumanoid
of the humanoidmodelmodel
model in Figure
in Figure 15 match 15well,
15 match
matchand well,
well, andthe
and theestimated
the estimated estimated jointangle
joint anglejoint angletrajectories
trajectories trajectories
plotted
plotted
plotted in Figure
in Figure
in Figure 16 have16 have apparently
16apparently
have apparently
abnormal abnormal
abnormal features.
features.features. Specifically,
Specifically,
Specifically, the fact that the
thethe fact
fact that
hipthat thethe hip
hip
and knee
and knee
and knee joint angles
joint anglesangles in the sagittal
in the sagittal
in the sagittal plane change
plane suddenly
plane change suddenly
change suddenly from 0
from(standing)
from 0 degree degree
0 degree (standing)
to 50(standing)
to 50to
50 (hip)
(hip)(hip)
and 120 and
and(knee)120 (knee)
degrees
120 (knee) degrees means
means means
degrees that
that thethat
subject the subject
the issubject
instantly is instantly
folding their
is instantly folding their
righttheir
folding right leg. AsAs
leg. As right leg.
such, other
such,
such,otherposes can
other poses be recognized based
poses can be recognized on
recognized basedthe
basedonrepresentative
onthe joint
therepresentative angle
representativejoint profiles
jointangle using
angleprofiles
profiles using
using
the proposed
the
theproposed approach.
proposed approach.
approach.
Figure
Figure14.14.Results
Resultsof 2D pose
of pose
2D estimation
pose obtained
estimation by MPP for a sequence
obtained of images inimages
a sudden
Figure
falling 14.
caseResults of 2D
(red line: right estimation
parts, blue line:obtained by MPPbyforMPP
left parts).
for a sequence
a sequence of imagesof in a sudden
in a sudden
falling
falling case
case (red
(red line:
line: right
right parts,
parts, blue blue line:
line: left left parts).
parts).
0.25
0.3 0.25
0.2
0.25
0.2
0.25
0.3
0.2 0.15
0.2
0.15
0.2
0.1
0.15
0.2 0.1
0.15
0.1 0.05
0.1
0.05
0.1
0.05 0
z (m)
0.1
z (m)
0.05 0
z (m)
0
-0.05
0
z (m)
-0.05
z (m)
0
z (m)
0 -0.1
-0.1 -0.1
-0.05 -0.05
-0.15
-0.1 -0.15
-0.1
-0.1
-0.2 -0.2
-0.15 -0.2
-0.15
-0.2 -0.25
-0.2 -0.25
-0.2
-0.25 -0.25
-0.2 -0.2
0.3 -0.2
0 0.1 0.2 0 0
-0.2 0 -0.2 0.2 0.3
0
0.2 -0.2 -0.1 0.2 0.3 -0.2 -0.1 0 0.1 0.2 -0.2 0.2
0.1 0.2 0.3
0.1 0 0 -0.2 -0.1 0
0 0.2 0.3
0.2 -0.2 -0.1 y (m) x (m) -0.2 0
-0.1 y (m) 0.1 0.2 0.2
0.3
x (m) x (m) -0.2 -0.1 y (m)
0 0.1 0.2
y (m) y (m)
x (m) x (m) x (m) y (m)
0.25
0.25
0.2 0.25
0.2
0.25 0.25
0.15 0.2
0.15 0.2 0.25
0.2
0.1 0.15
0.15 0.2
0.1
0.15
0.05 0.1
0.1 0.15
0.05
0.1
0 0.05
z (m)
0.05 0.1
z (m)
0.05 0
z (m)
-0.05 0
0 0.05
z (m)
-0.05
z (m)
0 -0.05
z (m)
-0.1 0
-0.05
-0.1
-0.05 -0.1
-0.15 -0.05
-0.1
-0.15
-0.1 -0.15
-0.2 -0.1
-0.15
-0.2
-0.15 -0.2
-0.15
-0.25
-0.2
-0.25
-0.2
-0.25
-0.2
-0.25
-0.25 -0.2 -0.25
-0.2
-0.2 0 -0.1 0.2
0.3
0 -0.2 0 0.1
0.1 0
0.3 0.2 -0.2 0.2 -0.1
-0.2 0.2 -0.1 0 0.1 0.2 0 0.1 0.2 0.3 -0.1 -0.2 0.3
-0.2 0 -0.2 -0.1 0 0.1
0.2
0 0.1 0
0.3 0.2 0.2 -0.1 y (m)
0.2 0
-0.1 y (m) 0.1 0.2 y (m)
0 0.1 0.2 0.3 -0.2
x (m) -0.2 x (m) -0.2 -0.1 x (m)
y (m)
x (m) y (m) x (m) y (m) x (m)
Figure 15. Results of the reconstructed 3D humanoid poses corresponding to Figure 13 (red line:
Figure
Figure
right 15.15.
Results
parts, blue of the
Results
line: ofreconstructed
left the 3D humanoid
reconstructed
parts). poses corresponding
3D humanoid to Figure 13to
poses corresponding (red line: 13 (red line:
Figure
right parts, blue line: left parts).
right parts, blue line: left parts).
100 200
100 200
50 100
sag-sh-r
sag-sh-l
50 100
sag-sh-r
sag-sh-l
0 0
0 0
-50 -100
-0.1 -0.1
-0.15
-0.15 -0.15
-0.2
-0.2 -0.2
-0.25
-0.25 -0.25
-0.2
-0.2
-0.2 0 -0.1 0.2
0.3
0 0 0.1
0.1 0
0.3 0.2 0.2 -0.1
0.2 -0.1 0 0.1 0.2 0 0.1 0.2 0.3 -0.2
-0.2 -0.2 -0.1
y (m)
x (m) y (m) x (m) y (m) x (m)
100 200
50 100
sag-sh-r
sag-sh-l
0 0
-50 -100
0 20 40 60 80 100 120 0 20 40 60 80 100 120
40 60
40
sag-el-r
sag-el-l
20
20
0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
50 100
0 50
sag-hp-r
sag-hp-l
-50 0
-100 -50
0 20 40 60 80 100 120 0 20 40 60 80 100 120
60 150
40 100
sag-kn-r
sag-kn-l
20 50
0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Appl. Sci. 2023, 13, x FOR PEER REVIEW no. frame no. frame 19 of 21
Figure 16.
Figure Angular
16.Angular trajectories
trajectories (in(in degrees)
degrees) of shoulder,
of the the shoulder, elbow,
elbow, hip,knee
hip, and andjoints
kneeinjoints in the
the sagittal
sagittal plane.
plane.
Actually, MPP estimates the depth data for each landmark as well [22]. As a valida-
Actually, MPP estimates the depth data for each landmark as well [22]. As a validation
tion of the depth
of the depth estimation
estimation performance
performance of MPP,
of MPP, FigureFigure 17 compares
17 compares two 3D two 3D standing
standing poses
poses estimated by the proposed method and MPP. As shown
estimated by the proposed method and MPP. As shown in the figure, the depth in the figure, the depth data
data of
of each joint estimated by the current version of MPP have large errors
each joint estimated by the current version of MPP have large errors when compared with when compared
withresults.
our our results. The ratio
The ratio of theof2D
theMPJPE
2D MPJPE
and 3D andMPJPE
3D MPJPE is 185.37
is 185.37 forstanding
for the the standing
posepose
and
and 277.81 for the arm-raised pose. The latter pose is more twisted than
277.81 for the arm-raised pose. The latter pose is more twisted than the former one from the former one
from the camera’s viewpoint, and thus the depth error leads to a more different
the camera’s viewpoint, and thus the depth error leads to a more different pose. This result pose. This
result
is whyiswe why weemployed
have have employedonly 2Donly 2D landmark
landmark information
information for recovering
for recovering 3D human
3D human pose
estimation in the present work. Even if the depth value is estimated almost exactlyexactly
pose estimation in the present work. Even if the depth value is estimated almost in the
in theversion
later later version
of MPP,of ourMPP, our can
method method can beasapplied
be applied asthe
it is, and it is,
3Dand the 3D accuracy
estimation estimationis
accuracy is expected
expected to be higher. to be higher.
Skeleton model recognised by MediaPipe Pose Skeleton model recognised by MediaPipe Pose
0.3 0.25
0.2
0.2
0.15
0.3
0.2
0.1
0.1 0.2
0.05 0.1
0.1
0 0
z
z
0 0
z
z
-0.05
-0.1
-0.1 -0.1 -0.1
-0.2 -0.15
-0.2 -0.2
-0.2
-0.3
0.2 -0.2
-0.3 0 -0.25 -0.2
-0.2 -0.1 0
0 -0.2 0
0.2 -0.2 -0.1 0 0.1 0.2 0.1 0.2 0.2
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
y
y y x
x y
(a) (b)
Figure 17. Comparison of 3D poses obtained from MPP (solid) and the proposed approach (dotted).
Figure 17. Comparison of 3D poses obtained from MPP (solid) and the proposed approach (dotted).
(a) standing pose, (b) arm-raised pose (red line: right parts, blue line: left parts).
(a) standing pose, (b) arm-raised pose (red line: right parts, blue line: left parts).
5. Conclusions
In this paper, we present a 3D human pose estimation system from monocular im-
ages and videos by taking 2D skeletal poses estimated by the off-the-shelf deep learning
method, MPP, as the input and fitting through reprojecting the 3D humanoid robot model
to the 2D model at the joint angle level using the fast optimization method, uDEAS. Re-
cently, most pose estimation methods are developed by using deep neural networks and,
Appl. Sci. 2023, 13, 2700 19 of 21
5. Conclusions
In this paper, we present a 3D human pose estimation system from monocular images
and videos by taking 2D skeletal poses estimated by the off-the-shelf deep learning method,
MPP, as the input and fitting through reprojecting the 3D humanoid robot model to the 2D
model at the joint angle level using the fast optimization method, uDEAS. Recently, most
pose estimation methods are developed by using deep neural networks and, thus, require
high-performance PCs or SBCs with many GPUs, which has limitation for application to
mobile robot systems because of rapid heating issue and purchasing difficulties due to a
lack of semiconductor supply chain. In order to improve the pose estimation performance,
we elaborated our full-body humanoid robot model by adding three joints at the root joint
and added a loss function CoM deviation term and penalty functions as constraints in the
joint angle ranges for pose balance. Adopting the CoM concept is a novel idea in the area of
pose estimation. With these efforts, the optimization execution time per frame is measured
at 0.033 s on a NUC without GPU, showing the feasibility of a real-time system.
To validate the proposed approach, we generated 3D simulation data for six ADL poses
and compared them with the poses estimated by uDEAS. The mean MPJPE was 0.097 m,
and the average angle difference per joint was 10.017 degree, which is an acceptable result
for pose estimation. The execution time of uDEAS was measured as 0.033 s in the case
when the number of restarts and the maximum row length were both six, which was below
the camera-capturing time of 0.1 s (10 fps); thus, it is likely that real-time pose estimation is
possible with the proposed system. In the experiment with the proposed system, a standing
to squatting activity, several whole-body exercises, and a dangerous activity of falling
were captured on video, and each frame was input into the proposed system. The results
show that very fast and drastic changes occur in the angular trajectories of the shoulder,
elbow, hip, and knee joints, providing a lot of information for activity recognition. In future
work, the proposed pose estimation system may be applied to analyze the activities of
construction workers and to monitor patients with Parkinson’s disease to build a database
of joint angles for human motions in target areas. It is expected that timely awareness of
abnormal or dangerous activities will be possible based on direct joint angle information.
In addition, the present approach without the use of deep learning model and dataset
can complement deep learning-based methods in analyzing and recognizing arbitrary
ADL poses.
Author Contributions: Software, J.-W.K., J.-Y.C. and E.-J.H.; Data curation, E.-J.H.; Writing—original
draft, J.-W.K.; Funding acquisition, J.-H.C. All authors have read and agreed to the published version
of the manuscript.
Funding: This research was funded by the National Research Foundation of Korea (NRF) with a
grant funded by the Korea government (MSIT) (No. NRF-2021R1A4A1022059) and by the NRF grant
funded by the MSIT (2020R1A2C1014649).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: No new data were created or analyzed in this study. Data sharing is
not applicable to this article.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Su, M.; Hayati, D.W.; Tseng, S.; Chen, J.; Wei, H. Smart Care Using a DNN-Based Approach for Activities of Daily Living (ADL)
Recognition. Appl. Sci. 2020, 11, 10. [CrossRef]
2. Noreils, F.R. Inverse kinematics for a Humanoid Robot: A mix between closed form and geometric solutions. Tech. Rep. 2017, 1–31.
[CrossRef]
3. Yu, Y.; Yang, X.; Li, H.; Luo, X.; Guo, H.; Fang, Q. Joint-level vision-based ergonomic assessment tool for construction workers.
J. Constr. Eng. Manag. 2019, 145, 04019025. [CrossRef]
4. Rokbani, N.; Casals, A.; Alimi, A.M. IK-FA, a new heuristic inverse kinematics solver using firefly algorithm. Comput. Intell. Appl.
Model. Control 2015, 369–395. [CrossRef]
Appl. Sci. 2023, 13, 2700 20 of 21
5. Xu, J.; Yu, Z.; Ni, B.; Yang, J.; Yang, X.; Zhang, W. Deep kinematics analysis for monocular 3d human pose estimation. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 899–908.
6. Li, J.; Xu, C.; Chen, Z.; Bian, S.; Yang, L.; Lu, C. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose
and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN,
USA, 19–25 June 2021; pp. 3383–3393.
7. Sarafianos, S.; Boteanu, B.; Ionescu, B.; Kakadiaris, I.A. 3D human pose estimation: A review of the literature and analysis of
covariates. Comput. Vis. Image Underst. 2016, 152, 1–20. [CrossRef]
8. Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image
Underst. 2020, 192, 102897. [CrossRef]
9. Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image
Underst. 2021, 210, 103225. [CrossRef]
10. Yurtsever, M.M.E.; Eken, S. BabyPose: Real-time decoding of baby’s non-verbal communication using 2D video-based pose
estimation. IEEE Sens. 2022, 22, 13776–13784. [CrossRef]
11. Alam, E.; Sufian, A.; Dutta, P.; Leo, M. Vision-based human fall detection systems using deep learning: A review. Comput. Biol.
Med. 2022, 146, 105626. [CrossRef]
12. Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-fine volumetric prediction for single-image 3D human pose.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017;
pp. 7025–7034.
13. Luvizon, D.C.; Picard, D.; Tabia, H. 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5137–5146.
14. Li, S.; Chan, A.B. 3d human pose estimation from monocular images with deep convolutional neural network. In Proceedings of
the Asian Conference on Computer Vision, Singapore, 1–5 November 2014; pp. 332–347.
15. Zhou, X.; Sun, X.; Zhang, W.; Liang, S.; Wei, Y. Deep kinematic pose regression. In Proceedings of the European Conference on
Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 186–201.
16. Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2500–2509.
17. Wang, J.; Huang, S.; Wang, X.; Tao, D. Not all parts are created equal: 3D pose estimation by modelling bi-directional dependencies
of body parts. arXiv 2019, arXiv:1905.07862.
18. Wandt, B.; Rosenhahn, B. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose
estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
16–20 June 2019; pp. 7782–7791.
19. Sigal, L.; Balan, A.O.; Black, M.J. Humaneva; Synchronized video and motion capture dataset and baseline algorithm for
evaluation of articulated human motion. IJCV 2010, 87, 4–27. [CrossRef]
20. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6m: Large scale datasets and predictive methods for 3d human
sensing in natural environments. TPAMI 2014, 36, 1325–1339. [CrossRef] [PubMed]
21. Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D human pose estimation in video with temporal convolutions and semi-
supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
16–20 June 2019; pp. 7753–7762.
22. MediaPipe Pose. Available online: https://google.github.io/mediapipe/solutions/pose.html (accessed on 28 December 2021).
23. Kim, J.-W.; Kim, T.; Park, Y.; Kim, S.W. On load motor parameter identification using univariate dynamic encoding algorithm for
searches (uDEAS). IEEE Trans. Energy Convers. 2008, 23, 804–813.
24. Vicon. Available online: https://www.vicon.com/ (accessed on 1 August 2021).
25. Vakanski, A.; Jun, H.P.; Paul, D.; Baker, R. A data set of human body movements for physical rehabilitation exercises. Data 2018,
3, 2. [CrossRef] [PubMed]
26. Bazarevsky, V.; Grishchenko, I. On-Device, Real-Time Body Pose Tracking with MediaPipe BlazePose, Google Research. Available
online: https://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html (accessed on 10 August 2021).
27. Denavit, J.; Hartenberg, R.S. A kinematic notation for lower-pair mechanisms based on matrices. J. Appl. Mech. 1955, 77, 215–221.
[CrossRef]
28. Kim, J.-W.; Tran, T.T.; Dang, C.V.; Kang, B. Motion and walking stabilization of humanoids using sensory reflex control. Int. J.
Adv. Robot. Syst. 2016, 13, 1–10.
29. Kim, J.-W.; Kim, T.; Choi, J.-Y.; Kim, S.W. On the global convergence of univariate dynamic encoding algorithm for searches
(uDEAS). Int. J. Control Autom. Syst. 2008, 6, 571–582.
30. Yun, J.P.; Choi, S.; Kim, J.-W.; Kim, S.W. Automatic detection of cracks in raw steel block using Gabor filter optimized by univariate
dynamic encoding algorithm for searches (uDEAS). NDT E Int. 2009, 42, 389–397. [CrossRef]
31. Kim, E.; Kim, M.; Kim, S.-W.; Kim, J.-W. Trajectory generation schemes for bipedal ascending and descending stairs using
univariate dynamic encoding algorithm for searches (uDEAS). Int. J. Control Autom. Syst. 2010, 8, 1061–1071. [CrossRef]
32. Kim, J.-W.; Ahn, H.; Seo, H.C.; Lee, S.C. Optimization of Solar/Fuel Cell Hybrid Energy System Using the Combinatorial Dynamic
Encoding Algorithm for Searches (cDEAS). Energies 2022, 15, 2779. [CrossRef]
Appl. Sci. 2023, 13, 2700 21 of 21
33. Goldberg, D.E. Genetic Algorithm in Search, Optimization and Machine Learning; Addison Wesley: Berkeley, CA, USA, 1999.
34. Size Korea. Available online: https://sizekorea.kr (accessed on 15 March 2022).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.