0% found this document useful (0 votes)
45 views

Human Pose Estimation Using MediaPipe Pose and Opt

Uploaded by

yunav25
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Human Pose Estimation Using MediaPipe Pose and Opt

Uploaded by

yunav25
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

applied

sciences
Article
Human Pose Estimation Using MediaPipe Pose and
Optimization Method Based on a Humanoid Model
Jong-Wook Kim , Jin-Young Choi, Eun-Ju Ha and Jae-Ho Choi *

Department of Electronics Engineering, Seunghak Campus, Dong-A University, Busan 49315, Republic of Korea
* Correspondence: [email protected]

Abstract: Seniors who live alone at home are at risk of falling and injuring themselves and, thus,
may need a mobile robot that monitors and recognizes their poses automatically. Even though deep
learning methods are actively evolving in this area, they have limitations in estimating poses that are
absent or rare in training datasets. For a lightweight approach, an off-the-shelf 2D pose estimation
method, a more sophisticated humanoid model, and a fast optimization method are combined to
estimate joint angles for 3D pose estimation. As a novel idea, the depth ambiguity problem of 3D
pose estimation is solved by adding a loss function deviation of the center of mass from the center of
the supporting feet and penalty functions concerning appropriate joint angle rotation range. To verify
the proposed pose estimation method, six daily poses were estimated with a mean joint coordinate
difference of 0.097 m and an average angle difference per joint of 10.017 degrees. In addition, to
confirm practicality, videos of exercise activities and a scene of a person falling were filmed, and the
joint angle trajectories were produced as the 3D estimation results. The optimized execution time
per frame was measured at 0.033 s on a single-board computer (SBC) without GPU, showing the
feasibility of the proposed method as a real-time system.

Keywords: human pose estimation; humanoid robot; global optimization method; MediaPipe
Pose; uDEAS

Citation: Kim, J.-W.; Choi, J.-Y.; Ha,


E.-J.; Choi, J.-H. Human Pose
1. Introductions
Estimation Using MediaPipe Pose
and Optimization Method Based on a
Due to improvements in medical technology and appropriate nutrition, the senior
Humanoid Model. Appl. Sci. 2023, 13, population is continuously growing all over the world. Seniors are monitored by health
2700. https://doi.org/10.3390/ managers or caregivers in nursing facilities, homes, or hospitals. Seniors who live alone at
app13042700 home are at risk of falling and injuring themselves unless a family member, a social worker,
or a caregiver is with them. A mobile robot that moves around the house, takes pictures
Academic Editor: Alessandro Di
of an elderly person’s pose from appropriate positions, and automatically analyzes their
Nuovo
current pose or activity to alert relevant people when a dangerous situation or issue arises
Received: 4 January 2023 would be very useful. To this end, full-body joint angle data for Activities of Daily Living
Revised: 10 February 2023 (ADL) is very efficient information for recognition, transmission to server, and restoration
Accepted: 15 February 2023 to DB as historical data [1].
Published: 20 February 2023 To estimate joint angles, the classical approach is to solve inverse kinematics (IK) with
a given set of 3D joint coordinates of a humanoid robot with a structure mimicking the
human body. To solve IK, there are no closed-form IK equations for the entire joints of such
complex robots. IK equations are derived from six parts: two arms, two legs, the torso, and
Copyright: © 2023 by the authors.
the head [2]. In [1,3], joint angles are calculated for each part using a geometric relationship,
Licensee MDPI, Basel, Switzerland.
which is difficult to apply in all variety of poses. On the other hand, heuristic optimization
This article is an open access article
methods called “Firefly Algorithm” have been applied to solve IK equations for a three-link
distributed under the terms and
articulated planar system [4]. Recently, attempts to replace inverse kinematics formulae for
conditions of the Creative Commons
Attribution (CC BY) license (https://
human models with deep learning methods are being actively pursued [5,6].
creativecommons.org/licenses/by/
Furthermore, motion capture systems that are quite expensive and take up a lot of
4.0/). space are the only way to obtain precise 3D joint coordinates for a subject. Therefore,

Appl. Sci. 2023, 13, 2700. https://doi.org/10.3390/app13042700 https://www.mdpi.com/journal/applsci


Appl. Sci. 2023, 13, 2700 2 of 21

the authors decided that the most realistic and simplest way would be to photograph a
subject with a 2D camera and estimate each joint angle using a fast optimization algorithm
based on a 3D humanoid model. This technology belongs to the field of 3D human pose
estimation with monocular image or video.
Human pose estimation technology is being actively researched around the world
in the areas of sports, surveillance, work monitoring, home elderly care, home training,
entertainment, gesture control, and even metaverse avatar. In general, human pose esti-
mation is classified into 2D and 3D coordinate estimation methods, single-person- and
multiple-person-based methods according to the number of target subjects, monocular
image- and multi-view image-based methods according to the number of shooting cameras,
and single-image- and video-based methods according to the input type [7–11].
Specifically, according to the structure of deep learning process, human pose estimation
is classified into single-stage methods and two-stage methods. The single-stage methods
that directly map input images to 3D body joint coordinates can be categorized into
two classes: detection-based methods [12,13] and regression-based methods [14,15]. The
detection-based methods predict a likelihood heatmap for each joint which location is
determined by taking the maximum likelihood of the heatmap, while the regression-based
methods directly estimate the location of joints relative to the root joint location [14] or the
angle of joints by introducing a kinematic model consisting of several joints and bones [15].
Because 2D pose estimation has a greater number of in-the-wild datasets with ground-
truth joint coordinates than 3D pose estimation, two-stage methods of leveraging 2D pose
estimation findings for 3D human pose estimation, also known as lifting from 2D to 3D, are
being developed extensively [16]. The relationships between joints have been exploited
by long short-term memory (LSTM) [17], and generative adversarial networks (GANs) are
often used to produce a more realistic 3D human pose [18].
In spite of continuous technological advances, the deep learning methods of 3D pose
estimation from 2D images should solve challenging problems, including a lack of in-the-
wild datasets, a huge demand for various posture data, depth ambiguities, and a large
searching state space for each joint [9]. Furthermore, a high-performance PC equipped
with many GPUs is essential for executing deep learning packages.
Firstly, collecting a large in-the-wild dataset with 3D annotation is very intensive,
and thus building the popular datasets of HumanEva [19] and Human3.6M [20] requires
expensive motion capture systems and many subjects and experiments. Secondly, human
pose has an infinite number of variants according to camera translation, body orientation,
differences in height and body part ratio, etc. Thirdly, the depth ambiguity problem
arises because different 3D poses can be mapped to a single 2D pose, which is known
to be mitigated using temporal information from a series of images or multi-viewpoint
images [21]. Lastly, the requirement for at least 17-dimensional joint space is also a high-
order problem to solve with conventional optimization methods, and thus optimization
results are time consuming and unacceptable in precision.
In this paper, to run a human pose estimation package on an SBC installed in a
mobile robot, a new type of two-stage pose estimation method is proposed. The first stage
of 2D pose estimation is performed with MediaPipe Pose [22], and the second stage of
estimating joint angles is carried out with a fast optimization method, uDEAS [23] based
on an elaborate humanoid model. We propose a 3D full-body humanoid model whose
reference coordinate frame is located at the center of the pelvis, i.e., root joint, and three DoF
(Degree of Freedom) lumbar joints are newly added to the center. The three lumbar joints
of twist, flexion, and lateral flexion can make poses in which only the upper body rotates or
bends, and thus they are indispensable joints necessary to create various natural poses, such
as yoga, sitting, and lying, amongst others. However, some recent deep learning methods
lack these core joints when modeling the human body [5,6]. In addition, the joint rotation
polarity rules for all the humanoid joints are designed to be consistent with those of the
Vicon motion capture system [24], which bridges numerous physiological research results
on various activities of the human body using Vicon data [25]. An innovative method for
Appl. Sci. 2023, 13, 2700 3 of 21

resolving the inverse kinematics of the humanoid is to use uDEAS to tune 19 unknown
pose-relevant variables for each frame of real-time pose estimation with camera-based or
video-based images. This allows the humanoid joint angles to fit the 2D humanoid model
that is reprojected from the 3D model to the MPP skeleton as closely as possible.
The proposed approach does not require the first problem of a significant amount of
human pose data, and a full set of optimization variables is constructed for resolving the
second pose variation problem. The third problem of depth ambiguity can be addressed
by adding deviation of center of mass from the center of the supporting feet as well as
appropriate penalty functions concerning an allowed range of joint angle rotation. The
fourth problem can be overcome by employing a fast optimization method, such as uDEAS.
For the validation of the proposed approach, several ADL poses were attained by
simulation and experiment. We generated simulations of human poses using a humanoid
model and given joint angles as ground-truth data, and we allowed uDEAS to estimate the
true joint angles by taking simulated poses as the input. In order to check for practicality,
gymnastics motion and sudden fall motion were filmed with a camera. The 3D pose
estimation results and the obtained joint trajectories were acceptable for application to
mobile robots that monitor poses. Unfortunately, most state-of-the-art deep learning
methods require CUDA-relevant libraries and a GPU hardware, and we could not apply
them to our small mobile robot.
The contributions of this paper are presented below:
• In order to simulate and estimate a human-like pose, a full-body humanoid robot
model with lumbar joints was constructed including effects of camera view angle
and distance.
• Instead of solving the inverse kinematics of a humanoid for a given 2D skeletal model,
the heuristic optimization method uDEAS directly adjusts the camera-relative body
angles and intra-body joint angles to match the 2D projected humanoid model to the
2D skeletal model.
• The depth ambiguity problem can be solved by adding a loss function deviation of
center of mass from the center of the supporting foot (feet) and appropriate penalty
functions for the ranges of natural joint angle rotations.
• The proposed 3D human body pose estimation system showed an average perfor-
mance of 0.33 s per frame using an inexpensive SBC without GPU.
• We find that rare poses resulting from falling activity were well estimated in the
present system. This may be difficult with deep learning methods due to the lack of
training data.
This paper is organized into five sections. Section 2 briefly describes the methods
that comprise the proposed pose estimation system. Section 3 explains the structure of
the proposed system, and Section 4 describes the experimental results when applying our
system to several representative poses. Section 5 concludes the present work and discusses
future work.

2. Pose Estimation Approach


2.1. MediaPipe Pose
In this paper, MediaPipe Pose (MPP), an open-source cross-platform framework
provided by Google, was employed to attain estimates of 2D human joint coordinates in
each image frame. MediaPipe Pose builds pipelines and processes cognitive data in the
form of video using machine learning (ML). MPP uses a BlazePose [26] that extracts 33 2D
landmarks on the human body as shown in Figure 1. BlazePose is a lightweight machine
learning architecture that achieves real-time performance on mobile phones and PCs with
CPU inference. When using normalized coordinates for pose estimation, inverse ratio
should be multiplied to the y-axis pixel values. Among the estimated MPP landmarks, we
used 12 landmarks to estimate arbitrary poses and motions, which indices are 11, 12, 13, 14,
15, 16, 23, 24, 25, 26, 27, and 28, as shown in Figure 1.
ing architecture that achieves real-time performance on mobile phones and PCs with CPU
inference. When using normalized coordinates for pose estimation, inverse ratio should
be multiplied to the y-axis pixel values. Among the estimated MPP landmarks, we used
Appl. Sci.12 landmarks
2023, 13, 2700 to estimate arbitrary poses and motions, which indices are 11, 12, 13, 14, 15,4 of 21
16, 23, 24, 25, 26, 27, and 28, as shown in Figure 1.

Figure 1. Definition ofFigure


landmarks in MediaPipe
1. Definition Pose
of landmarks [22].
in MediaPipe Pose [22].

2.2. Humanoid Robot Model


2.2. Humanoid Robot Model
The human body must be represented using a humanoid robot model that mimics
The human body must be represented
the organization of human-likeusing a humanoid
links and robot
joints in order model that
to reconstruct 3D mimics
human poses
the organization of human-like links and joints in order to reconstruct 3D humanreconstructed
from 2D joint data collected from MPP. Therefore, arbitrary 3D poses can be poses
from 2D images taken at arbitrary viewing angles and distances from a camera by mea-
from 2D joint data collected from MPP. Therefore, arbitrary 3D poses can be reconstructed
suring link lengths in pixels and estimating joint angles of the humanoid model using an
from 2D images taken at arbitrary
optimization viewing angles and distances from a camera by meas-
method.
uring link lengths in pixels and humanoid
In general, estimating joint
robots areangles
describedof with
the humanoid model
links and joints based using an
on the Denavit–
optimization method.Hartenberg (DH) method [27] in which a reference coordinate frame is placed on the
supporting foot. Since the goal of the current approach is to generate and estimate poses
In general, humanoid robots are described with links and joints based on the De-
of humanoid that are as human-like as possible, we improved our previous humanoid
navit–Hartenberg (DH)
modelmethod [27] in
[28] to generate whichposes
arbitrary a reference coordinate frame is placed on
as follows:
the supporting foot.• Since the the
Locating goal of the
origin of thecurrent
reference approach
frame at theis center
to generate andi.e.,
of the body, estimate
the root joint,
poses of humanoid thattoare as human-like
create arbitrary poses.as possible, we improved our previous human-
oid model [28] to generate
• Addingarbitrary
three DoFposes
lumbaras spine
follows:
joints at the center of the pelvis to create poses where
only the upper body moves separately.
• Locating the origin
• of the reference frame at the center of the body, i.e., the root joint,
Redefining the rotational polarity of all joint variables to match the Vicon motion
to create arbitrary capture
poses. system for better interoperability of the joint data measured by the system.
• Adding three DoF As lumbar
shownspine
in Figure joints
2, theatproposed
the center of the pelvis
humanoid model isto create poses
composed where
of a total of 23 joint
only the upper body moves separately.
variables. That is, it is
 composed of 12 joint angles which rotating axes are perpendicular
• Redefining thetorotational polarity
the sagittal plane l,r
θhd , of
θsh , θell,rjoint
all l,r
, θtr , θvariablesl,r l,r
hp , θkn , θ an to match
, 7 joint thewhich
angles Vicon motion
rotating axes are
capture systemperpendicular
for better interoperability of the
φl,r , joint data l,rmeasured by the system.
 
to the frontal plane sh φtr , φl,r
hp , φan , and 4 joint angles which rotating
As shown in Figure
axes are 2, perpendicular
the proposedtohumanoid the transversemodel
planeis
(ψcomposed l,r of a total of 23 joint
hd , ψtr , ψhp ), where the subscripts hd, sh,
variables. That is, itel,
is tr,
composed
hp, kn, andof an 12 joint joint
indicate angles which
names of therotating axes are
head, shoulder, perpendicular
elbow, to and
torso, hip, knee,
the sagittal plane (θ hd , θ sh , θ el , θtr , θ hp , θ kn , θ an ) , 7 joint angles which rotating axes are perpen-
ankle, l , respectively,
r l , r l , and
r l ,the
r superscripts
l , r l and r denote the left and right parts, respectively.

dicular to the frontal plane (φ l ,r


sh , φtr , φhpl ,r , φanl ,r ) , and 4 joint angles which rotating axes are
perpendicular to the transverse plane (ψ hd , ψ tr , ψ hp
l ,r
), where the subscripts hd, sh, el, tr, hp,
Appl. Sci. 2023, 13, x FOR PEER REVIEW 5 of 2

Appl. Sci. 2023, 13, 2700 5 of 21

kn, and an indicate joint names of the head, shoulder, elbow, torso, hip, knee, and ankle
respectively, and the superscripts l and r denote the left and right parts, respectively.
θ hd
x24
z24
lhd x23
z23 −ψ hd
x18
lnk x23
z22
z18 x22 x14
x19 −φ r
x15
sh lsh
lua z14 −φshl
z15
lla θ shr x 19
z19
x15
x21 θ elr x ltr θ shl
z21 20 z20 z t
lua z16
tr

ψ tr xtrt z13 x16


φtr θ l
x13 el
xtr0 lw z17
θtr x0
z0 lla
s ,c
z tr ztr0 zH
yH
x17
xH

zH
z0 yH
xH
z0 z0
x7 x0
−ψ l
ψ hpr hp

x7 y0 lpv x1
z7
x8
−φ r
hp x1
z1
l fe x2 φhpl
z8
θ hpr x8
x9 z2
x10 z9 l fe x2
θ knr
ltb θ hpl
−θ anr z10 x3
x10 x11 z3
ltb
θ l
kn
z11
x11 x12
x5
φanr lca θ anl x4 z4

z5 x x6
z12 x12 5

−φanl lca

z6
x6

Figure 2. The 3D humanoid robot model.


Appl. Sci. 2023, 13, x FOR PEER REVIEW 6 of 21

2.3. Reflecting Camera Effect


When deciding on a full-body pose, the position and viewing angle of the camera
Figure 2. The 3D humanoid robot model.
with respect to the subject are also crucial factors. The relative camera position determines
Appl. Sci. 2023, 13, 2700 the overall body size, which can be reflected to the humanoid model by multiplying 6 of 21 the
2.3. Reflecting Camera Effect
size factor γ to all the link lengths shown in Figure 2. That is, as the camera moves away,
When deciding on a full-body pose, the position and viewing angle of the camera
the body
with sizetodecreases,
respect the subjecti.e., γ <crucial
are also 1 , andfactors.
vice versa, i.e., γ camera
The relative > 1 . position determines
2.3. Reflecting Camera Effect
the In addition,
overall the relative
body size, which can camera view angle
be reflected makes themodel
to the humanoid same by standing posethe
multiplying look dif-
size When
factor
ferent, as shown γdeciding
to all on
the a
linkfull-body
lengths pose,
shown the
in position
Figure and
2. That viewing
is, as angle
the
in Figure 3. Figure 3a shows a case where the camera shoots a subject of
camera the camera
moves with
away,
respect to the subject are also crucial factors. The relative camera position determines the
from top to
the body sizebottom.
decreases, Figure γ < shows
i.e., 3b 1 , and vice
a situation where
versa, i.e., γ > 1 the
. camera is rotated 90 degrees
overall body size, which can be reflected to the humanoid model by multiplying the size
clockwise.
In γ
factor
In Figure
addition,
to all thethe
link
3c, the camera
relative
lengths
camera
shownview
views a subject
angle
in Figure makes
2.
from
That theis, assame
the standing
front left.
the camera
These
pose
moves lookdifferences
away, dif-
the
inferent,
pose as
due shown
to the in Figure
camera 3.
view Figure
angle 3a shows
can be
body size decreases, i.e., γ < 1, and vice versa, i.e., γ > 1. a case where
mathematically the camera
described shoots
by a
the subject
relationship
from top
between to bottom.
the Figure
relative3bcamera
bodythecoordinate
In addition, shows
frame a situation
viewand where
themakes
angle thesame
camera-based
the camera is rotated
coordinate
standing pose90 degrees
frame,
look which is
differ-
clockwise.
shown
ent, asinshownIn Figure
Figurein4.Figure3c, the
The sagittalcamera
3. Figure body views a θbd , where
subject
anglea case
3a shows from the
the coronal front angle aφsubject
left.
bodyshoots
the camera These differences
bd
, and the
from trans-
intop
pose due to the
to bottom. camera
Figure view angle
3b shows can be
a situation mathematically
where described
the camera is bydegrees
rotated 90 the relationship
clockwise.
versal body3c,angle
between
In Figurethe body ψ bd views
correspond
coordinate
the camera frame tofrom
and
a subject
the three
thethe pose changes
camera-based in Figure
coordinate
front left. These frame,
differences
3, respectively.
which
in pose dueisto
shown
The in Figure
thepolarity
camera of 4. The
these
view sagittal
body
angle body
can angle angle
parameters
be mathematicallyθ , the
bd is
coronal
determined
described body angle φ ,
by thetorelationship and the
matchbdthe Vicon’s trans-
between sign
the con-
body coordinate
vention,
versal suchangle
body that ψθbd and
frame >correspond
0 the
forcamera-based
forward
to the tilting,
three φbd changes
coordinate
pose >frame,
0 forwhich
inleft is shown
tilting,
Figure and ψ bd >4.0 for
in Figure
3, respectively.
The sagittal body angle θ , the coronal body angle φ , and the transversal body angle ψ
left
Therotation
polarity[24], vicebdangle
andbody
of these bd
versa.parameters is determined to match the Vicon’s sign con-bd
correspond to the three pose changes in Figure 3, respectively. The polarity of these body
vention, such that θ bd > 0 for forward tilting, φbd > 0 for left tilting, and ψ bd > 0 for
angle parameters is determined to match the Vicon’s sign convention, such that θ > 0 for bd
left rotation
forward [24], φand
tilting, vice versa.
bd > 0 for left tilting, and ψbd > 0 for left rotation [24], and vice versa.

(a) (b) (c)


(a) (b) (c)
Figure 3. Identical poses looking different according to the camera view angles. (a) from top to
Figure
Figure3.3.Identical poses
Identical looking
poses lookingdifferent according
different to the
according camera
to the cameraviewview
angles. (a) from
angles. top totop to
(a) from
bottom,
bottom,(b)
(b)rotated 90
rotated 90 degrees
degrees clockwise,
clockwise, (c)(c) from
from the the
frontfront
left. left.
bottom, (b) rotated 90 degrees clockwise, (c) from the front left.

zzcc
zzhh
θθbdbd
φφbd yyc c
bd

xxhh yh
yh
−ψ bd xc
−ψ bd xc
Figure 4. Relationship between the body coordinate fame, xh yh zh , and the camera-based coordinate
frame, xc yc zc .
Appl. Sci. 2023, 13, 2700 7 of 21

2.4. Fast Global Optimization Method


For an estimation of the joint angles in the humanoid model that fits well with the MPP
skeleton model for the current frame, inverse kinematics is basically necessary. However, it
takes a long time to solve the formula-based inverse kinematics of the humanoid robot due
to the complicated structure of the humanoid model, as shown in Figure 2.
uDEAS has been developed for solving non-smooth and multimodal engineering
problems. uDEAS validates the fastest and most reliable global optimization performance
on seven well-known low-dimensional (two to six) benchmark functions, three high-
dimensional (up to 30) benchmark functions [29], the optimal designs of Gabor filter [30],
and the joint trajectory generation for a humanoid’s ascending and descending stairs [31].
In addition, a modified version of uDEAS that can also search integer variables, cDEAS
(combinatorial DEAS), has been recently developed and applied to the optimization of
hybrid energy system [32].
uDEAS is a global optimization method that hybridizes local and global search
schemes. For local search in uDEAS, all the optimization variables are represented by
binary strings, such as genetic algorithm (GA) [33]. A basic unit of the local search is a
session composed of single bisectional search (BSS) and multiple unidirectional search
(UDS) with a binary string for each variable. The BSS attaches 0 and 1 at the end of the
selected string as a new least significant bit (LSB), e.g., 0102 ← 012 → 0112 , where the
insertion of 0 (1) as a new LSB of the binary string corresponds to a decrease (increase) in
its decoded real value compared to that of the parent string [23]. For example, assume that
the binary string 0102 is converted by the decoding function into a real value 0.3 and the
cost value is 0.7, i.e., J (d(0102 )) = J (0.3) = 0.7, whereas the binary string 0112 is decoded
into 0.1 and its cost value is 0.3, i.e., J (d(0112 )) = J (0.1) = 0.3. Because J (0.1) < J (0.3),
increasing the current variable turns out to be a good search scheme as a result of the BSS.
On the other hand, the UDS adds or subtracts by 1 the optimal binary string according
to the promising direction found at the previous BSS. For the BSS example mentioned
above, the subsequent UDS will produce a binary string along the better BSS direction
such that 1002 (first UDS) → 1012 (second UDS) → · · · until no more cost reduction occurs.
This set of BSS and UDS plays a significant role in balancing exploitation and exploration
in the local search, respectively.
For the n-dimensional problem, uDEAS stacks up the n strings to make up a binary
matrix with n rows.
• Step 1. Initialization of new restart: Make an n × m binary matrix M which elements
are randomly chosen binary digits. The row length index m is set m0 . The optimization
variable vector is v = [v1 v2 · · · vn ] T .
• Step 2. Start the first session with i = 1.
• Step 3. BSS: From the current best matrix M, the binary vector of the j(= J(i ))-th row
is selected as
h i
r j (M) = a jm a j(m−1) · · · a j1 , a jk = {0, 1}, 1 ≤ k ≤ m (1)

Attach 0 or 1 as a new LSB of the row vector, which yields


h i h i
r−
j = a jm a j ( m − 1 ) · · · a j1 0 , r +
j = a jm a j ( m − 1 ) · · · a j1 1 (2)

Then, these strings are decoded into real values and replaced with the jth variable of
the current best optimization variable vector v∗ as follows:
   
v− = v+ = v∗ , v− ( j ) = d r− + +
j , v ( j) = d r j (3)

Next, compute cost values J (v− ) and J (v+ ). If J (v− ) < J (v+ ), the direction for the
UDS is set as u( j) = −1; otherwise, u( j) = 1. The better row is saved as r∗j .
v − = v + = v * , v − ( j ) = d ( r j− ) , v + ( j ) = d ( r j+ ) (3)

Next, compute cost values J v− ( ) and J ( v + ) . If J ( v − ) < J ( v + ) , the direction for


the UDS is set as u( j) = −1; otherwise, u( j ) = 1 . The better row is saved as r*j .
Appl. Sci. 2023, 13, 2700 • 8 of 21
Step 4. UDS: Depending on the direction u ( j ) , perform addition or subtraction to
the jth row, which is described as

j ( M ) = rj u
((Mj)), +perform
*
• Step 4. UDS: Depending on therdirection u ( j ) addition or subtraction to the
(4)
jth row, which is described as
Check whether the new row rj contributes to a further reduction of the loss func-

tion. If so, the current binary string ) = rthe
r j (Mand j (M ) + u( j)are updated as the optimal ones
variable (4)
as follows, and go to Step 4.
Check whether the new row r j contributes to a further reduction of the loss function.
j , v ( j) = d (rj )
r*j = rthe * *
If so, the current binary string and variable are updated as the optimal ones (5) as
follows, and go to Step 4.
Otherwise, go to Step 5.
 
r∗j = r j ,v∗ ( j) = d r∗j (5)
• Step 5. Save the resultant UDS best string, r ( M ) , into the jth row of the current
*
Otherwise, go to Step 5.
• best matrix.
Step 5. Save the resultant UDS best string, r∗ (M), into the jth row of the current
• Step 6. If i < n , set i = i +1. Go to Step 3. Otherwise, if the current string length m is
best matrix.
• Step 6. Ifthan
shorter the
i < n, setprescribed maximal
i = i + 1. Go to Step row length miff the
3. Otherwise, i = 1 , string
, setcurrent increase the m
length row
is
length index as m = m + 1 , and go to Step 2. In thef case of m = m f , go to Step 7.
shorter than the prescribed maximal row length m , set i = 1, increase the row length
index as m = m + 1, and go to Step 2. In the case of m = m f , go to Step 7.
• Step 7. If the number of restarts is less than the specified value, go to Step 1. Other-
• Step
wise,7.terminate
If the number of restarts
the current is search
local less than the specified
routine and choosevalue,
thegoglobal
to Step 1. Otherwise,
minimum with
terminate the current local search routine and
the smallest cost value among the local minima found so far. choose the global minimum with the
smallest cost value among the local minima found so far.
3. Proposed Pose Estimation Algorithm
3. Proposed Pose Estimation Algorithm
Pose Estimation
Pose Estimation Process
Process
Figure 55 shows
Figure shows the
the overall
overall flow
flow diagram
diagramof
ofthe
theproposed
proposedpose
poseestimation
estimationsystem.
system.

Figure 5. Flow
Figure 5. Flow diagram
diagram of
of the
the proposed
proposed pose
pose estimation
estimation process.
process.

• Step 1. Calibration of link length: Our system checks whether the human subject is a
new user or not because the subject’s bone length information is basically necessary
for the model-based pose estimation. If the present system has no link length data for
the current subject, the link length measurement process begins; the subject stands
with the arms stretched down, images are captured for at least 10 frames, and the
length of each bone link is calculated as the average distance between the coordinates
of the end joints of the bone at each frame.
Appl. Sci. 2023, 13, 2700 9 of 21

• Step 2. Acquire images from an RGB camera with an image grabber module of SBC.
Although an Intel RealSense camera is used in the present system, commercial RGB
webcams are also available.
• Step 3. Execute MPP and obtain 2D pixel coordinates of the 17 landmarks for the
captured human body.
• Step 4. Execute uDEAS to seek for unknown pose-relevant variables, such as the cam-
era’s distance factor and viewing angles, and the intrabody joint angles by reducing
the loss function formulated with the L2 norm between the joint coordinates obtained
with MPP and those reprojected onto the corresponding 2D plane.
• Step 5. Plot the estimated poses in 2D or 3D depending on the application field.
• Step 6. If the current image frame is the last one or a termination condition is met, stop
the pose estimation process. Otherwise, go to Step 2.
For an estimation of arbitrary poses at each frame, the size factor, γ, and the three
body angle values related to the camera view angle, such as θbd , φbd , and ψbd mentioned
in Section 2.3, are added to the list of optimization variables. Therefore, a complete
optimization vector for pose estimation consisting of 19 variables is estimated as follows:
h iT
l l r r l
V = γ, θbd , φbd , ψbd , θtr , φtr , ψtr , θhp , θkn , θhp , θkn , θsh , θell , θsh
r r
, θel l
, φhp r
, φhp l
, φsh r
, φsh (6)

The loss function to be minimized by uDEAS is designed to reflect three features:


mean per joint position error (MPJPE) between the 2D MPP skeleton model and the fitted
humanoid model; deviation of the 2D coordinates of the center of mass (CoM) from the
center of the supporting feet of the 3D humanoid’s pose; and penalty values concerning
the joint angles of the humanoid pose for consistency with the natural human pose.
The MPJPE is calculated by the mean distance between the true coordinates
 (for
i,j i,j
simulation case) or the MPP pixel coordinates (for experimental case), x p , y p and the
coordinates
  of the 2D reprojected humanoid model onto the camera-based frame in Figure 4,
i,j i,j
yc , zc , which is described as
   
i,j i,j i,j i,j
∑ k x p , y p − yc , zc k
2
MPJPE(v) = , i = l, r, j = sh, el, wr, hp, kn, an (7)
12
When the two poses overlap exactly, this value decreases to zero.
Figure 6 shows three models: the MPP pixel model, the initial humanoid model, and
the tuned model, which pelvis centers are moved to the origin of the coordinate frame.
uDEAS simultaneously adjusts the model contraction factor γ and the body and joint angles
in Equation (7) to minimize the average deviation between the MPP pixel model and the
contracted humanoid model in terms of the distance between the joint coordinates in the
2D plane, which is illustrated with red arrows.
The humanoid’s CoM deviation (CoMD) measured on the floor as the second pose
metric is significant for measuring the standing stability of the current pose generated by
uDEAS. Based on the camera-based coordinate frame xc yc zc defined in Figure 4, CoM D is
described as follows:
   
l, f t l, f t r, f t r, f t
xc , yc + xc , yc
xcCoM , yCoM

k c − 2 k2
CoMD (v) = (8)
lleg

xcCoM CoM is the coordinate of the humanoid’s CoM projected onto the floor;

where
  , yc
i, f t i, f t
xc , yc , i = l, r denotes the center of the floor coordinates of the left and right feet;
and lleg denotes the length of the leg attained by summing the three links in the leg, i.e.,
lleg = l f e + ltb + lca , which is necessary for normalization.
where (x CoM
c , ycCoM ) is the coordinate of the humanoid’s CoM projected onto the floo

(x i , ft
c , yci , ft ) , i = l , r denotes the center of the floor coordinates of the left and right feet; an
lleg denotes the length of the leg attained by summing the three links in the leg, i.
Appl. Sci. 2023, 13, 2700 10 of 21
lleg = l fe + ltb + lca , which is necessary for normalization.

Initial Humanoid
robot model

γ γ

MPP pixel model


Contracted humanoid
model

γ γ

Figure 6. Pose
Figure 6. Posecomparison
comparison in the
in the 2D 2D frontal
frontal planeplane between
between thepixel
the MPP MPP pixel(solid
model modelline)(solid line) and t
and the
fitted model
fitted model(dashed line)
(dashed contracted
line) fromfrom
contracted the initial humanoid
the initial model (dash-dotted
humanoid line).
model (dash-dotted line).

As the third metric of loss function, penalty functions concerning the suitable bounds
As the third metric of loss function, penalty functions concerning the suitable boun
of some joint angles are newly proposed to make it possible to find the best fit among
of someidentical
several joint angles arefornewly
3D poses proposed
an estimated to make
MPP pose. Figureit 7possible to find
shows three typesthe best fit amon
of penalty
several
functionsidentical
for joint3D poses
angles. forsingle-sided
The an estimated MPP(positive)
negative pose. Figure 7 shows
penalty three
function types
Psn (P sp ) of pe
alty
and functions for joint
the double-sided angles.
penalty The single-sided
function as follows:(positive) penalty function Ps
Pd are defined negative
Psp ) and the double-sided penalty function ( P are defined as follows:
|θ − σn |, d θ < σn
Psn (θ, σn ) = (9)
0, θ ≥ σn
θ −σn ,θ < σn
Psn (θ(, σ n ) =  (
θ − σp ,0, θ > σθp ≥ σ n
Psp (θ, σp ) = (10)
0, θ ≤ σp
 θ − σ p , θ > σ p
Psp(θ , σ
p |)θ=−σn |, θ < σn

(1
Pd θ, σn , σp = θ − p , θ ≥ σθp ≤ σ p
σ0, (11)
0, σn ≤ θ < σp

where σn and σ  θ − σ n values


, θ <for
σ n joint angle θ, respec-
 p represent negative and positive threshold


Pd θ , σ n ,uDEAS
tively. Psn Psp plays the role of informing
(
σ p = by )
θ −increasing
σ p , θ ≥the σ ploss function that an (1

unrealistic pose is generated when a specific joint angle falls (increases) below (above) the
assigned threshold angle. For example, when a human  0, σ
is ≤ θ <
standing,
n σ a
p sagittal torso angle
◦ ◦
θtr smaller than −20 , i.e., leaning back, is rare, and thus adding Ps (θtr , −20 ) to the loss
function may prevent the generation of an unnatural pose by uDEAS. On the other hand,
Pds is a useful function for creating the most realistic pose when a certain joint angle is
within the two angles σn and σp . For instance, a sitting pose generally has the range of
coronal torso angle φtr between −10◦ and 10◦ , and thus adding Ps (φtr , −10◦ , 10◦ ) to the
loss function will help uDEAS select the feasible φtr . The total loss function is formulated
in Section 4.
uDEAS. On the other hand, P is a useful function for creating the most realistic pose
uDEAS. On the other hand, Pdsds is a useful function for creating the most realistic pose
when a certain joint angle is within the two angles σ and
when a certain joint angle is within the two angles σ n nand σσp .p For instance, a sitting
. For instance, a sitting
pose generally has the range of coronal torso angle φ between −10° and 10° , and
pose generally has the range of coronal torso angle φtr tr between −10 ° and 10° , and
thus adding Ps (φtr , −10°,10°) to the loss function will help uDEAS select the feasible φtr
Appl. Sci. 2023, 13, 2700 thus adding Ps (φtr , −10°,10°) to the loss function will help uDEAS select the feasible11 φ
oftr21
. The total loss function is formulated in section 4.
. The total loss function is formulated in section 4.

P
Psn sn PspPsp P
Pd d

σ θ σ θ σ σ θ
σn n θ σp p θ σn n σp p θ

(a) (b) (c)


(a) (b) (c)
Figure 7. Penalty functions for suitable joint angles. (a) single-sided negative function, (b) single-
Figure
Figure7.7.Penalty
Penaltyfunctions
functionsfor
forsuitable
suitablejoint
jointangles.
angles.(a)(a)single-sided
single-sidednegative
negativefunction,
function,(b)
(b)single-
single-
sided positive function, and (c) double-sided function.
sided
sidedpositive
positivefunction,
function,and
and(c)
(c)double-sided
double-sidedfunction.
function.
4. Experimental
Experimental Setup Setup and
and Results
Results
4.4.Experimental Setup and Results
For the
For the proposed
proposed human
human pose
pose estimation
estimation system,
system, Intel
Intel NUC
NUC 11th11th i7 (NUC11TNKi7)
(NUC11TNKi7)
For the proposed human pose estimation system, Intel NUC 11th i7i7(NUC11TNKi7)
was
wasused used
used as as
as thethe SBC for camera image acquisition and joint estimation, andRealSense
Intel Re-
was the SBC
SBCforforcamera
cameraimage
image acquisition
acquisitionandand
jointjoint
estimation, and Intel
estimation, and Intel Re-
alSense
is is installed
installed as a commercial
as a commercial RGB-DRGB-DRGB-D In
sensor. sensor. In the SBC,
the SBC, the aCPU is a quad-core i7-
alSense is installed as a commercial sensor. In thethe
SBC,CPUtheisCPUquad-core i7-1165G7
is a quad-core i7-
1165G7
(4.7 GHz) (4.7GHz)
and the and the
memory memory
is 16 GB is 16GB
DDR4. DDR4.
The HDD The HDD is
is SSDisNVMeSSD NVMe
480 GB, 480GB,
and theand the
1165G7 (4.7GHz) and the memory is 16GB DDR4. The HDD SSD NVMe 480GB, andOStheis
OS is Ubuntu
Ubuntu 18.04.
18.04.18.04.
For For a high-mobility
a For
high-mobility pose estimation
pose estimation capability,capability, these modules
these modules are
are installed
OS is Ubuntu a high-mobility pose estimation capability, these modules are
installed
on a mobile on a mobile robot base, as shown in Figure 8.
installed on arobot
mobilebase, as base,
robot shown asinshown
Figurein8.Figure 8.

Figure 8.
Figure 8. Human pose recognition
recognition system.
system.
Figure 8. Human pose recognition system.
The
The basic
basic configuration
configuration of
of uDEAS
uDEAS isis determined
determined as
asfollows:
follows:
The basic configuration of uDEAS is determined as follows:
•• Number of optimization variables:
Number of optimization variables: 19.19.
• Number of optimization variables: 19.
• Initial row length: 3.
• Maximum row length: 12.
• Number of maximum restarts: 20.
For pose estimation with a single image or an initial image of a video, the search
ranges of each variable need to be determined appropriately, i.e., not too wide for search
efficiency and not too narrow for inclusion of global minima. Table 1 summarizes the
upper and lower bounds of the optimization variables in which all the joint variables can
generate six ADL poses. In pose estimation with simulated poses where the ground-truth
joint angles are known, the size factor is set as 1.0. Among the joint variables, the knee
and elbow joints must have only positive angles, and, thus, their lower bounds are all set
as zero.
For pose estimation with a sequence of images, the search ranges are adjusted around
the optimal variables found by uDEAS in the previous image frame. In this paper, the upper
limit of the optimization variable was determined by adding 10 degrees to the optimal
variable of the previous frame, and the lower limit was set by subtracting 10 degrees from
this. In this case, when the minimum rotation angle must be 0 degrees, such as the elbow
γ θ bd φbd ψ bd θ tr φtr ψ tr θhpl θ knl θhpr θ knr θ shl θ ell θ shr θelr φhpl φhpr φshl φ
V 1 10 10 90 90 40 30 90 90 90 90 180 90 180 90 40 40 40
V Appl. Sci.
1 2023,
−10 −10
13, 2700 −90 −20 −40 −30 −20 0 −20 0 −180 0 −180 0 −40 −40
12 of 21 −40 −

For pose estimation with a sequence of images, the search ranges are adjusted aro
the optimal
or knee variables
joint, care must befound byensure
taken to uDEAS thatin
thethe previous
adjusted image
search’s frame.
lower In this
limit does not paper,
fall below
upper limit0 degrees.
of the optimization variable was determined by adding 10 degrees to the
timal variable of the previous frame, and the lower limit was set by subtracting 10 deg
Table 1. Upper and lower bounds of the optimization variables (in degrees).
from this. In this case, when the minimum rotation angle must be 0 degrees, such as
γ θbd φbd ψbd elbow
θtr ortr knee
φ ψtr joint,
θlhp care
θlkn must
θrhp be
θrkntaken
θlsh toθensure
l
el θrsh that
θrel the ladjusted
φhp r
φhp φsearch's
l
sh
r
φsh lower l
V 1 10 10 90
does
90
not
40
fall30
below90
0 degrees.
90 90 90 180 90 180 90 40 40 40 40
V 1 −10 −10 −90 −20 −40 −30 −20 0 −20 0 −180 0 −180 0 −40 −40 −40 −40
4.1. Pose Estimation with Simulation Data
ForEstimation
4.1. Pose the performance validation
with Simulation Data of the proposed approach, we generated six po
(1) stand and
For the raise arms
performance front; (2)
validation of stand and raise
the proposed arms we
approach, up;generated
(3) walksix
with the left le
poses;
front; (4) bend down and grab an object; (5) sit on a chair; and (6) kneel
(1) stand and raise arms front; (2) stand and raise arms up; (3) walk with the left leg in down.
front;Figure
(4) bend9down
shows that
and grabthe
anloss function
object; (5) sit onprofiles gradually
a chair; and (6) kneelminimize
down. during 20 rest
Figure 9 shows that the loss function profiles gradually minimize during
for a given pose. It is apparent that the loss functions converge when the row lengt 20 restarts
for auDEAS
the given pose. It is apparent
matrices that the loss functions converge when the row length of the
reach 10.
uDEAS matrices reach 10.
0
10

-1
10
current best cost

-2
10

-3
10
3 4 5 6 7 8 9 10 11 12
row length

Figure 9. Minimization aspect of loss function during 20 restarts of uDEAS with each restart
Figure 9. Minimization aspect of loss function during 20 restarts of uDEAS with each restart co
colored differently.
ored differently.
Figure 10 shows that uDEAS successfully estimates the six poses under the current
Figure configuration.
optimization 10 shows that uDEAS
Table successfully
2 lists the estimates
MPJPE of each the six in
global minimum poses under
Equation (7),the cur
optimization configuration. Table 2 lists the MPJPE of each global minimum in Equa
the average angular difference between the 18 ground-truth joint angles in Equation (6),
and the
(7), the estimated ones for the
average angular six poses.between
difference The average
theMPJPE is 0.097 m, and
18 ground-truth theangles
joint averagein Equa
angle difference per joint is 10.017 degrees, which is an acceptable result for pose estimation
with a full-scale humanoid model. The body size was selected referring to average Korean
women in their twenties [34]. It is worth noting that all poses were created with the body
transverse angles (ψbd ) between 20 and 40 degrees for generalization, which indicated that
the camera view angle was set to the side of the subject.

Table 2. MPJPE and average angular difference values for the best-fitted poses in Figure 10.

Pose 1 2 3 4 5 6 Avg.
MPJPE (m) 0.0055 0.0099 0.0111 0.0049 0.0150 0.0116 0.097
Avg. ang. diff (deg) 6.061 7.748 10.557 5.6 14.558 15.58 10.017
(6), and the estimated ones for the six poses. The average MPJPE is 0.097 m, and the aver-
age angle difference per joint is 10.017 degrees, which is an acceptable result for pose es-
timation with a full-scale humanoid model. The body size was selected referring to aver-
age Korean women in their twenties [34]. It is worth noting that all poses were created
with the body transverse angles (ψ bd ) between 20 and 40 degrees for generalization, which
Appl. Sci. 2023, 13, 2700 13 of 21
indicated that the camera view angle was set to the side of the subject.

Figure
Figure10.10.Pose
Poseestimation
estimation results
results for
for the six representative poses
poses generated
generatedby bythe
thehumanoid
humanoidmodel
model (solid line: true pose, dotted line: estimated pose, red line: right parts, blue line: left parts).
(solid line: true pose, dotted line: estimated pose, red line: right parts, blue line: left parts).

4.2. Pose Estimation with Experiment


For the performance validation of the proposed pose estimation method, several
activities in standing, sitting, and lying poses were filmed with an RGB camera and
analyzed by our system.
Figure 11 shows three poses captured and analyzed while standing and squatting.
Figure 11a,b show the 2D poses estimated by MPP and the reprojected poses attained by
uDEAS, and Figure 11c shows the reconstructed 3D poses.
+ γ sag _ sh _ r Psn (θ shr , −10° ) + γ cor _ sh _ l Psp (φshl , 20° ) + γ cor _ sh _ r Psp (φshl , 20° )

where the weights γ CoM and γ cor _ sh _ r denote the effects of each term on the loss function
to reflect the adequacy of the reconstructed 3D pose. Because the MPJPE was in the order
Appl. Sci. 2023, 13, 2700 −3 14 of 21
of 10 , these weights needed to be 0.01. The threshold values of the penalty functions
can be selected appropriately for a target pose and activity.

Appl. Sci. 2023, 13, x FOR PEER REVIEW 15 of 21

(a) (b) (c)


Figure 11. Pose
Figure 11. Pose estimation
estimation results
results for
for three
three poses
poses generated
generated by
by the
the humanoid
humanoidmodel.
model.(a)
(a)Original
Original
images
images and
and MPP
MPP results
results overlaid,
overlaid, (b)
(b) comparison
comparisonbetween
betweenthe
theMPP
MPPmodel
model(solid
(solidline)
line)and
andthe
thefitted
fitted
humanoid model (dotted line) in 2D plane, and (c) the humanoid model reconstructed in 3D plane
humanoid model (dotted line) in 2D plane, and (c) the humanoid model reconstructed in 3D plane
(red line: right parts, blue line: left parts).
(red line: right parts, blue line: left parts).

Figure
The loss12function
shows the trajectories
needs to reflectof the
the estimated
pose match, joint angles
pose for thepose
stability, poses from Figure
symmetry, and
11a–c. It is noteworthy that the sagittal hip and knee joints move
penalty values for the joint angles of the torso and both shoulders as follows: from 0 to around 100
degrees, and the sagittal torso angles change from 0 to 60 degrees, which match the actual
human joint angles rather well. These angle  trajectories also provide  information on the
current
L(v) =stance,
MJCDmaking
(v) + γthem
CoM helpful
CoMD ( v ) for
+ γ medical
sym θ l −
hp or
θ rtherapeutic
hp + θ kn kn + γsag_tr Psn (θtr , −10◦ )
l − θ rpurposes.
 
+γcor_tr Pd (φtr , −10◦ , 10◦ ) + γtran_tr Pd (ψtr , −15◦ , 15◦ ) + γsag_sh_l Psn θshl , −10◦
   
r , −10◦ + γ l ◦ +γ l ◦

+γsag_sh_r Psn θsh cor_sh_l Psp φsh , 20 cor_sh_r Psp φsh , 20

where the weights γCoM and γcor_sh_r denote the effects of each term on the loss function to
reflect the adequacy of the reconstructed 3D pose. Because the MPJPE was in the order of
10−3 , these weights needed to be 0.01. The threshold values of the penalty functions can be
selected appropriately for a target pose and activity.
(a) (b) (c)
Figure 11. Pose estimation results for three poses generated by the humanoid model. (a) Original
images and MPP results overlaid, (b) comparison between the MPP model (solid line) and the fitted
Appl. Sci. 2023, 13, 2700 humanoid model (dotted line) in 2D plane, and (c) the humanoid model reconstructed in 3D15 plane
of 21
(red line: right parts, blue line: left parts).

Figure 12 shows the trajectories of the estimated joint angles for the poses from Figure
Figure 12 shows the trajectories of the estimated joint angles for the poses from
11a–c. It is noteworthy that the sagittal hip and knee joints move from 0 to around 100
Figure 11a–c. It is noteworthy that the sagittal hip and knee joints move from 0 to around
degrees, and the sagittal torso angles change from 0 to 60 degrees, which match the actual
100 degrees, and the sagittal torso angles change from 0 to 60 degrees, which match the
human joint angles rather well. These angle trajectories also provide information on the
actual human joint angles rather well. These angle trajectories also provide information on
current stance, making them helpful for medical or therapeutic purposes.
the current stance, making them helpful for medical or therapeutic purposes.

(a) (b)
Figure
Figure 12.
12. Trajectories
Trajectories of the estimated
estimated joint
joint angles
angles(degree)
(degree)atat(a)
(a)the
thetorso
torsoand
and(b)
(b)the
thesagittal
sagittalplane.
plane.
For real-time application of the proposed pose optimization system, the execution time
needsFor to be measured
real-time while running
application uDEAS
of the on NUC.
proposed pose Table 3 lists the mean
optimization run the
system, timeexecution
per frame
measured
time needswhile changing the
to be measured whilenumber
runningof restarts
uDEASand the maximum
on NUC. rowthe
Table 3 lists length
meanofrun uDEAS.
time
Interestingly,
per frame measured the optimal
whilerow length the
changing is found
numberto beof six whenand
restarts number of restartsrow
the maximum is set at 10
length
because the loss of the next row length (7.22 × 10 −3 ) increases significantly. In the same manner,
of uDEAS. Interestingly, the optimal row length is found to be six when number of restarts
the
is setoptimal numberthe
at 10 because of restarts is 6 next
loss of the whenrow the length
optimal( 7.22 10 −3 ) increases
row×length is 12. As asignificantly.
combinationIn of
these results, the loss 7.04 ×the − 3
the same manner, theand meannumber
optimal run timeofper frameisare
restarts 6 when 10 optimal
and 0.033
rows,length
respectively,
is 12.
in the case when the number of restarts and the maximum row length are both six. Since −3
this
As a combination of these results, the loss and mean run time per frame are 7.04 × 10
optimization
and execution timeinisthe
0.033 s, respectively, belowcase100 ms, i.e.,
when thecamera-capturing
number of restarts time forthe
and eachmaximum
frame at 10row fps,
it is likely that real-time pose estimation is possible with the proposed
length are both six. Since this optimization execution time is below 100ms, i.e., camera- system.
capturing time for each frame at 10fps, it is likely that real-time pose estimation is possible
Table 3. Comparison of uDEAS run time measured in NUC (bold: optimal configuration).
with the proposed system.
Figure 13 shows Max.
No. Restart various
Row 2D poses and Loss
Length the (corresponding
×10−3 ) 3DRun
Avg. poses
Timereconstructed
Per Frame (s)
from the 2D poses using the camera 12
images and MPP.
6.52
It can be seen that the reconstructed
0.180
11 6.63 0.165
10 6.63 0.137
9 6.54 0.118
10 8 6.74 0.096
7 6.72 0.078
6 6.94 0.062
5 7.22 0.044
4 12.89 0.028
9 6.65 0.170
8 6.73 0.149
7 6.68 0.130
12
6 6.98 0.113
5 7.15 0.096
4 7.98 0.079
6 6 7.04 0.033
7 6.72 0.078
6 6.94 0.062
5 7.22 0.044
4 12.89 0.028
9 6.65 0.170
Appl. Sci. 2023, 13, 2700 16 of 21
8 6.73 0.149
7 6.68 0.130
12
6 6.98 0.113
Figure 13 shows various 2D poses and the corresponding 3D poses reconstructed
5 7.15 0.096
from the 2D poses using the camera images and MPP. It can be seen that the reconstructed
4 7.98 0.079
humanoid models are similar to the actual 3D poses owing to the loss function that reflects
6 6
many conditions related to the stable poses. 7.04 0.033

Appl. Sci. 2023, 13, x FOR PEER REVIEW 17 of 21

Figure 13. Cont.


Appl. Sci. 2023, 13, 2700 17 of 21

Figure 13. Two-dimensional


Figure13. Two-dimensional poses and the
poses and the corresponding
corresponding3D
3Dposes
posesreconstructed
reconstructedbyby the
the proposed
proposed
Figure 13. Two-dimensional poses and the corresponding 3D poses reconstructed by the proposed
approach
approachviewed
viewed from
from various
various angles (red line:
line: right
rightparts,
parts,blue
blueline:
line:left
leftparts).
parts).
approach viewed from various angles (red line: right parts, blue line: left parts).

To
Toverify
verify the
the activity
activity estimation
estimation performance,aasudden
performance, suddenfallfallmotion
motion estimation was
To verify the activity estimation performance, a sudden fall motion estimationestimation
was was
attempted,
attempted, as
as shown
shown in
in Figures
Figures 14
14 and
and 15.
15. The
The images
images captured
captured
attempted, as shown in Figures 14 and 15. The images captured in Figure 14 and the poses
in
in Figure
Figure 1414 and
and the
the poses
poses
of the
thehumanoid
ofhumanoid
of the humanoidmodelmodel
model in Figure
in Figure 15 match 15well,
15 match
matchand well,
well, andthe
and theestimated
the estimated estimated jointangle
joint anglejoint angletrajectories
trajectories trajectories
plotted
plotted
plotted in Figure
in Figure
in Figure 16 have16 have apparently
16apparently
have apparently
abnormal abnormal
abnormal features.
features.features. Specifically,
Specifically,
Specifically, the fact that the
thethe fact
fact that
hipthat thethe hip
hip
and knee
and knee
and knee joint angles
joint anglesangles in the sagittal
in the sagittal
in the sagittal plane change
plane suddenly
plane change suddenly
change suddenly from 0
from(standing)
from 0 degree degree
0 degree (standing)
to 50(standing)
to 50to
50 (hip)
(hip)(hip)
and 120 and
and(knee)120 (knee)
degrees
120 (knee) degrees means
means means
degrees that
that thethat
subject the subject
the issubject
instantly is instantly
folding their
is instantly folding their
righttheir
folding right leg. AsAs
leg. As right leg.
such, other
such,
such,otherposes can
other poses be recognized based
poses can be recognized on
recognized basedthe
basedonrepresentative
onthe joint
therepresentative angle
representativejoint profiles
jointangle using
angleprofiles
profiles using
using
the proposed
the
theproposed approach.
proposed approach.
approach.

Appl. Sci. 2023, 13, x FOR PEER REVIEW 18 of 21


Appl. Sci. 2023, 13, x FOR PEER REVIEW 18 of 21

Figure
Figure14.14.Results
Resultsof 2D pose
of pose
2D estimation
pose obtained
estimation by MPP for a sequence
obtained of images inimages
a sudden
Figure
falling 14.
caseResults of 2D
(red line: right estimation
parts, blue line:obtained by MPPbyforMPP
left parts).
for a sequence
a sequence of imagesof in a sudden
in a sudden
falling
falling case
case (red
(red line:
line: right
right parts,
parts, blue blue line:
line: left left parts).
parts).
0.25
0.3 0.25
0.2
0.25
0.2
0.25
0.3
0.2 0.15
0.2
0.15
0.2
0.1
0.15
0.2 0.1
0.15
0.1 0.05
0.1
0.05
0.1
0.05 0
z (m)

0.1
z (m)

0.05 0
z (m)

0
-0.05
0
z (m)

-0.05
z (m)

0
z (m)

0 -0.1
-0.1 -0.1
-0.05 -0.05
-0.15
-0.1 -0.15
-0.1
-0.1
-0.2 -0.2
-0.15 -0.2
-0.15
-0.2 -0.25
-0.2 -0.25
-0.2
-0.25 -0.25
-0.2 -0.2
0.3 -0.2
0 0.1 0.2 0 0
-0.2 0 -0.2 0.2 0.3
0
0.2 -0.2 -0.1 0.2 0.3 -0.2 -0.1 0 0.1 0.2 -0.2 0.2
0.1 0.2 0.3
0.1 0 0 -0.2 -0.1 0
0 0.2 0.3
0.2 -0.2 -0.1 y (m) x (m) -0.2 0
-0.1 y (m) 0.1 0.2 0.2
0.3
x (m) x (m) -0.2 -0.1 y (m)
0 0.1 0.2
y (m) y (m)
x (m) x (m) x (m) y (m)
0.25
0.25
0.2 0.25
0.2
0.25 0.25
0.15 0.2
0.15 0.2 0.25
0.2
0.1 0.15
0.15 0.2
0.1
0.15
0.05 0.1
0.1 0.15
0.05
0.1
0 0.05
z (m)

0.05 0.1
z (m)

0.05 0
z (m)

-0.05 0
0 0.05
z (m)

-0.05
z (m)

0 -0.05
z (m)

-0.1 0
-0.05
-0.1
-0.05 -0.1
-0.15 -0.05
-0.1
-0.15
-0.1 -0.15
-0.2 -0.1
-0.15
-0.2
-0.15 -0.2
-0.15
-0.25
-0.2
-0.25
-0.2
-0.25
-0.2
-0.25
-0.25 -0.2 -0.25
-0.2
-0.2 0 -0.1 0.2
0.3
0 -0.2 0 0.1
0.1 0
0.3 0.2 -0.2 0.2 -0.1
-0.2 0.2 -0.1 0 0.1 0.2 0 0.1 0.2 0.3 -0.1 -0.2 0.3
-0.2 0 -0.2 -0.1 0 0.1
0.2
0 0.1 0
0.3 0.2 0.2 -0.1 y (m)
0.2 0
-0.1 y (m) 0.1 0.2 y (m)
0 0.1 0.2 0.3 -0.2
x (m) -0.2 x (m) -0.2 -0.1 x (m)
y (m)
x (m) y (m) x (m) y (m) x (m)

Figure 15. Results of the reconstructed 3D humanoid poses corresponding to Figure 13 (red line:
Figure
Figure
right 15.15.
Results
parts, blue of the
Results
line: ofreconstructed
left the 3D humanoid
reconstructed
parts). poses corresponding
3D humanoid to Figure 13to
poses corresponding (red line: 13 (red line:
Figure
right parts, blue line: left parts).
right parts, blue line: left parts).
100 200
100 200
50 100
sag-sh-r
sag-sh-l

50 100
sag-sh-r
sag-sh-l

0 0
0 0
-50 -100
-0.1 -0.1
-0.15
-0.15 -0.15
-0.2
-0.2 -0.2
-0.25
-0.25 -0.25

-0.2
-0.2
-0.2 0 -0.1 0.2
0.3
0 0 0.1
0.1 0
0.3 0.2 0.2 -0.1
0.2 -0.1 0 0.1 0.2 0 0.1 0.2 0.3 -0.2
-0.2 -0.2 -0.1
y (m)
x (m) y (m) x (m) y (m) x (m)

Appl. Sci. 2023, 13, 2700 18 of 21


Figure 15. Results of the reconstructed 3D humanoid poses corresponding to Figure 13 (red line:
right parts, blue line: left parts).

100 200

50 100

sag-sh-r
sag-sh-l
0 0

-50 -100
0 20 40 60 80 100 120 0 20 40 60 80 100 120
40 60

40

sag-el-r
sag-el-l

20
20

0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
50 100

0 50

sag-hp-r
sag-hp-l

-50 0

-100 -50
0 20 40 60 80 100 120 0 20 40 60 80 100 120
60 150

40 100

sag-kn-r
sag-kn-l

20 50

0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Appl. Sci. 2023, 13, x FOR PEER REVIEW no. frame no. frame 19 of 21
Figure 16.
Figure Angular
16.Angular trajectories
trajectories (in(in degrees)
degrees) of shoulder,
of the the shoulder, elbow,
elbow, hip,knee
hip, and andjoints
kneeinjoints in the
the sagittal
sagittal plane.
plane.
Actually, MPP estimates the depth data for each landmark as well [22]. As a valida-
Actually, MPP estimates the depth data for each landmark as well [22]. As a validation
tion of the depth
of the depth estimation
estimation performance
performance of MPP,
of MPP, FigureFigure 17 compares
17 compares two 3D two 3D standing
standing poses
poses estimated by the proposed method and MPP. As shown
estimated by the proposed method and MPP. As shown in the figure, the depth in the figure, the depth data
data of
of each joint estimated by the current version of MPP have large errors
each joint estimated by the current version of MPP have large errors when compared with when compared
withresults.
our our results. The ratio
The ratio of theof2D
theMPJPE
2D MPJPE
and 3D andMPJPE
3D MPJPE is 185.37
is 185.37 forstanding
for the the standing
posepose
and
and 277.81 for the arm-raised pose. The latter pose is more twisted than
277.81 for the arm-raised pose. The latter pose is more twisted than the former one from the former one
from the camera’s viewpoint, and thus the depth error leads to a more different
the camera’s viewpoint, and thus the depth error leads to a more different pose. This result pose. This
result
is whyiswe why weemployed
have have employedonly 2Donly 2D landmark
landmark information
information for recovering
for recovering 3D human
3D human pose
estimation in the present work. Even if the depth value is estimated almost exactlyexactly
pose estimation in the present work. Even if the depth value is estimated almost in the
in theversion
later later version
of MPP,of ourMPP, our can
method method can beasapplied
be applied asthe
it is, and it is,
3Dand the 3D accuracy
estimation estimationis
accuracy is expected
expected to be higher. to be higher.

Skeleton model recognised by MediaPipe Pose Skeleton model recognised by MediaPipe Pose
0.3 0.25

0.2
0.2
0.15
0.3
0.2
0.1
0.1 0.2
0.05 0.1
0.1
0 0
z
z

0 0
z
z

-0.05
-0.1
-0.1 -0.1 -0.1

-0.2 -0.15
-0.2 -0.2
-0.2
-0.3
0.2 -0.2
-0.3 0 -0.25 -0.2
-0.2 -0.1 0
0 -0.2 0
0.2 -0.2 -0.1 0 0.1 0.2 0.1 0.2 0.2
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
y
y y x
x y

(a) (b)
Figure 17. Comparison of 3D poses obtained from MPP (solid) and the proposed approach (dotted).
Figure 17. Comparison of 3D poses obtained from MPP (solid) and the proposed approach (dotted).
(a) standing pose, (b) arm-raised pose (red line: right parts, blue line: left parts).
(a) standing pose, (b) arm-raised pose (red line: right parts, blue line: left parts).

5. Conclusions
In this paper, we present a 3D human pose estimation system from monocular im-
ages and videos by taking 2D skeletal poses estimated by the off-the-shelf deep learning
method, MPP, as the input and fitting through reprojecting the 3D humanoid robot model
to the 2D model at the joint angle level using the fast optimization method, uDEAS. Re-
cently, most pose estimation methods are developed by using deep neural networks and,
Appl. Sci. 2023, 13, 2700 19 of 21

5. Conclusions
In this paper, we present a 3D human pose estimation system from monocular images
and videos by taking 2D skeletal poses estimated by the off-the-shelf deep learning method,
MPP, as the input and fitting through reprojecting the 3D humanoid robot model to the 2D
model at the joint angle level using the fast optimization method, uDEAS. Recently, most
pose estimation methods are developed by using deep neural networks and, thus, require
high-performance PCs or SBCs with many GPUs, which has limitation for application to
mobile robot systems because of rapid heating issue and purchasing difficulties due to a
lack of semiconductor supply chain. In order to improve the pose estimation performance,
we elaborated our full-body humanoid robot model by adding three joints at the root joint
and added a loss function CoM deviation term and penalty functions as constraints in the
joint angle ranges for pose balance. Adopting the CoM concept is a novel idea in the area of
pose estimation. With these efforts, the optimization execution time per frame is measured
at 0.033 s on a NUC without GPU, showing the feasibility of a real-time system.
To validate the proposed approach, we generated 3D simulation data for six ADL poses
and compared them with the poses estimated by uDEAS. The mean MPJPE was 0.097 m,
and the average angle difference per joint was 10.017 degree, which is an acceptable result
for pose estimation. The execution time of uDEAS was measured as 0.033 s in the case
when the number of restarts and the maximum row length were both six, which was below
the camera-capturing time of 0.1 s (10 fps); thus, it is likely that real-time pose estimation is
possible with the proposed system. In the experiment with the proposed system, a standing
to squatting activity, several whole-body exercises, and a dangerous activity of falling
were captured on video, and each frame was input into the proposed system. The results
show that very fast and drastic changes occur in the angular trajectories of the shoulder,
elbow, hip, and knee joints, providing a lot of information for activity recognition. In future
work, the proposed pose estimation system may be applied to analyze the activities of
construction workers and to monitor patients with Parkinson’s disease to build a database
of joint angles for human motions in target areas. It is expected that timely awareness of
abnormal or dangerous activities will be possible based on direct joint angle information.
In addition, the present approach without the use of deep learning model and dataset
can complement deep learning-based methods in analyzing and recognizing arbitrary
ADL poses.

Author Contributions: Software, J.-W.K., J.-Y.C. and E.-J.H.; Data curation, E.-J.H.; Writing—original
draft, J.-W.K.; Funding acquisition, J.-H.C. All authors have read and agreed to the published version
of the manuscript.
Funding: This research was funded by the National Research Foundation of Korea (NRF) with a
grant funded by the Korea government (MSIT) (No. NRF-2021R1A4A1022059) and by the NRF grant
funded by the MSIT (2020R1A2C1014649).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: No new data were created or analyzed in this study. Data sharing is
not applicable to this article.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Su, M.; Hayati, D.W.; Tseng, S.; Chen, J.; Wei, H. Smart Care Using a DNN-Based Approach for Activities of Daily Living (ADL)
Recognition. Appl. Sci. 2020, 11, 10. [CrossRef]
2. Noreils, F.R. Inverse kinematics for a Humanoid Robot: A mix between closed form and geometric solutions. Tech. Rep. 2017, 1–31.
[CrossRef]
3. Yu, Y.; Yang, X.; Li, H.; Luo, X.; Guo, H.; Fang, Q. Joint-level vision-based ergonomic assessment tool for construction workers.
J. Constr. Eng. Manag. 2019, 145, 04019025. [CrossRef]
4. Rokbani, N.; Casals, A.; Alimi, A.M. IK-FA, a new heuristic inverse kinematics solver using firefly algorithm. Comput. Intell. Appl.
Model. Control 2015, 369–395. [CrossRef]
Appl. Sci. 2023, 13, 2700 20 of 21

5. Xu, J.; Yu, Z.; Ni, B.; Yang, J.; Yang, X.; Zhang, W. Deep kinematics analysis for monocular 3d human pose estimation. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 899–908.
6. Li, J.; Xu, C.; Chen, Z.; Bian, S.; Yang, L.; Lu, C. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose
and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN,
USA, 19–25 June 2021; pp. 3383–3393.
7. Sarafianos, S.; Boteanu, B.; Ionescu, B.; Kakadiaris, I.A. 3D human pose estimation: A review of the literature and analysis of
covariates. Comput. Vis. Image Underst. 2016, 152, 1–20. [CrossRef]
8. Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image
Underst. 2020, 192, 102897. [CrossRef]
9. Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image
Underst. 2021, 210, 103225. [CrossRef]
10. Yurtsever, M.M.E.; Eken, S. BabyPose: Real-time decoding of baby’s non-verbal communication using 2D video-based pose
estimation. IEEE Sens. 2022, 22, 13776–13784. [CrossRef]
11. Alam, E.; Sufian, A.; Dutta, P.; Leo, M. Vision-based human fall detection systems using deep learning: A review. Comput. Biol.
Med. 2022, 146, 105626. [CrossRef]
12. Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-fine volumetric prediction for single-image 3D human pose.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017;
pp. 7025–7034.
13. Luvizon, D.C.; Picard, D.; Tabia, H. 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5137–5146.
14. Li, S.; Chan, A.B. 3d human pose estimation from monocular images with deep convolutional neural network. In Proceedings of
the Asian Conference on Computer Vision, Singapore, 1–5 November 2014; pp. 332–347.
15. Zhou, X.; Sun, X.; Zhang, W.; Liang, S.; Wei, Y. Deep kinematic pose regression. In Proceedings of the European Conference on
Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 186–201.
16. Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2500–2509.
17. Wang, J.; Huang, S.; Wang, X.; Tao, D. Not all parts are created equal: 3D pose estimation by modelling bi-directional dependencies
of body parts. arXiv 2019, arXiv:1905.07862.
18. Wandt, B.; Rosenhahn, B. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose
estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
16–20 June 2019; pp. 7782–7791.
19. Sigal, L.; Balan, A.O.; Black, M.J. Humaneva; Synchronized video and motion capture dataset and baseline algorithm for
evaluation of articulated human motion. IJCV 2010, 87, 4–27. [CrossRef]
20. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6m: Large scale datasets and predictive methods for 3d human
sensing in natural environments. TPAMI 2014, 36, 1325–1339. [CrossRef] [PubMed]
21. Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D human pose estimation in video with temporal convolutions and semi-
supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
16–20 June 2019; pp. 7753–7762.
22. MediaPipe Pose. Available online: https://google.github.io/mediapipe/solutions/pose.html (accessed on 28 December 2021).
23. Kim, J.-W.; Kim, T.; Park, Y.; Kim, S.W. On load motor parameter identification using univariate dynamic encoding algorithm for
searches (uDEAS). IEEE Trans. Energy Convers. 2008, 23, 804–813.
24. Vicon. Available online: https://www.vicon.com/ (accessed on 1 August 2021).
25. Vakanski, A.; Jun, H.P.; Paul, D.; Baker, R. A data set of human body movements for physical rehabilitation exercises. Data 2018,
3, 2. [CrossRef] [PubMed]
26. Bazarevsky, V.; Grishchenko, I. On-Device, Real-Time Body Pose Tracking with MediaPipe BlazePose, Google Research. Available
online: https://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html (accessed on 10 August 2021).
27. Denavit, J.; Hartenberg, R.S. A kinematic notation for lower-pair mechanisms based on matrices. J. Appl. Mech. 1955, 77, 215–221.
[CrossRef]
28. Kim, J.-W.; Tran, T.T.; Dang, C.V.; Kang, B. Motion and walking stabilization of humanoids using sensory reflex control. Int. J.
Adv. Robot. Syst. 2016, 13, 1–10.
29. Kim, J.-W.; Kim, T.; Choi, J.-Y.; Kim, S.W. On the global convergence of univariate dynamic encoding algorithm for searches
(uDEAS). Int. J. Control Autom. Syst. 2008, 6, 571–582.
30. Yun, J.P.; Choi, S.; Kim, J.-W.; Kim, S.W. Automatic detection of cracks in raw steel block using Gabor filter optimized by univariate
dynamic encoding algorithm for searches (uDEAS). NDT E Int. 2009, 42, 389–397. [CrossRef]
31. Kim, E.; Kim, M.; Kim, S.-W.; Kim, J.-W. Trajectory generation schemes for bipedal ascending and descending stairs using
univariate dynamic encoding algorithm for searches (uDEAS). Int. J. Control Autom. Syst. 2010, 8, 1061–1071. [CrossRef]
32. Kim, J.-W.; Ahn, H.; Seo, H.C.; Lee, S.C. Optimization of Solar/Fuel Cell Hybrid Energy System Using the Combinatorial Dynamic
Encoding Algorithm for Searches (cDEAS). Energies 2022, 15, 2779. [CrossRef]
Appl. Sci. 2023, 13, 2700 21 of 21

33. Goldberg, D.E. Genetic Algorithm in Search, Optimization and Machine Learning; Addison Wesley: Berkeley, CA, USA, 1999.
34. Size Korea. Available online: https://sizekorea.kr (accessed on 15 March 2022).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like