sensors-24-00305
sensors-24-00305
Article
Accurate Robot Arm Attitude Estimation Based on Multi-View
Images and Super-Resolution Keypoint Detection Networks
Ling Zhou, Ruilin Wang and Liyan Zhang *
College of Mechanical & Electrical Engineering, Nanjing University of Aeronautics and Astronautics,
Nanjing 210016, China; [email protected] (L.Z.); [email protected] (R.W.)
* Correspondence: [email protected]
Abstract: Robot arm monitoring is often required in intelligent industrial scenarios. A two-stage
method for robot arm attitude estimation based on multi-view images is proposed. In the first stage,
a super-resolution keypoint detection network (SRKDNet) is proposed. The SRKDNet incorporates
a subpixel convolution module in the backbone neural network, which can output high-resolution
heatmaps for keypoint detection without significantly increasing the computational resource con-
sumption. Efficient virtual and real sampling and SRKDNet training methods are put forward. The
SRKDNet is trained with generated virtual data and fine-tuned with real sample data. This method
decreases the time and manpower consumed in collecting data in real scenarios and achieves a better
generalization effect on real data. A coarse-to-fine dual-SRKDNet detection mechanism is proposed
and verified. Full-view and close-up dual SRKDNets are executed to first detect the keypoints and
then refine the results. The keypoint detection accuracy, [email protected], for the real robot arm reaches
up to 96.07%. In the second stage, an equation system, involving the camera imaging model, the
robot arm kinematic model and keypoints with different confidence values, is established to solve the
unknown rotation angles of the joints. The proposed confidence-based keypoint screening scheme
makes full use of the information redundancy of multi-view images to ensure attitude estimation
accuracy. Experiments on a real UR10 robot arm under three views demonstrate that the average
estimation error of the joint angles is 0.53 degrees, which is superior to that achieved with the
comparison methods.
Citation: Zhou, L.; Wang, R.; Zhang,
L. Accurate Robot Arm Attitude
Keywords: robot arm; attitude estimation; super-resolution keypoint detection network (SRKDNet);
Estimation Based on Multi-View
multi-view images
Images and Super-Resolution
Keypoint Detection Networks. Sensors
2024, 24, 305. https://doi.org/
10.3390/s24010305
1. Introduction
Academic Editors: Teng Huang,
In the context of intelligent manufacturing, robot arms with multiple joints play
Qiong Wang and Yan Pang
increasingly important roles in various industrial fields [1,2]. For instance, robot arms
Received: 19 November 2023 are utilized to accomplish automatic drilling, riveting and milling tasks in aerospace
Revised: 24 December 2023 manufacturing; in automobile and traditional machinery manufacturing fields, robot arms
Accepted: 27 December 2023 can be frequently seen in automatic loading/unloading, automatic measurement and other
Published: 4 January 2024 production or assembly tasks.
In most industrial applications, a robot arm works in accordance with the pre-planned
program. However, on occasions where the robot arm becomes out of control by mistake,
serious collision or injury accidents may occur, especially in the work context of human–
Copyright: © 2024 by the authors.
machine cooperation. Therefore, it is critical to configure monitoring means to ensure safety.
Licensee MDPI, Basel, Switzerland.
On-site attitude monitoring of working robot arms is also essential for the collaborative
This article is an open access article
distributed under the terms and
work of multiple robot arms.
conditions of the Creative Commons
Machine vision is one of the most suitable and widely used monitoring means due
Attribution (CC BY) license (https://
to its relatively low cost, high applicability and good accuracy. To reduce the difficulty of
creativecommons.org/licenses/by/ image feature recognition and to improve monitoring accuracy and reliability, a common
4.0/). method in industry is to arrange cooperative visual targets on the monitored object [3,4].
to improve the resolution of the output keypoint heatmaps and in turn improve the key-
point positioning accuracy by introducing the idea of super-resolution image reconstruction
into the keypoint detection network.
To monitor the attitude of a robot arm, it is essential to solve the rotation angle of
each joint. Based on the depth image of the robot arm, Widmaier et al. [16] used a random
forest regression model to estimate the robot arm attitude. Labbe et al. [17] and Zuo
et al. [18] estimated the robot arm attitude based on one single grayscale image. However,
serious joint occlusion is inevitable in one single-perspective image, which makes it hard
to detect some keypoints and may even lead to wrong estimation results. Moreover, the
depth ambiguity problem in monocular vision may lead to multiple solutions in attitude
estimation, reducing the monitoring reliability of the robot arm.
In this paper, we present a two-stage high-precision attitude estimation method for
base-fixed six-joint robot arms based on multi-view images. The contributions include the
following: (1) A new super-resolution keypoint detection network (SRKDNet for short)
is proposed. The novelty of the SRKDNet lies in that a subpixel convolution module is
incorporated in the backbone neural network HRNet [11] to learn the law of resolution
recovery of the downsampled feature maps. This method can alleviate the disadvantages of
low-resolution heatmaps and improve the keypoint detection accuracy without significantly
increasing the computing resource consumption. (2) A coarse-to-fine detection mechanism
based on dual SRKDNets is put forward. A full-view SRKDNet obtains a relatively rough
keypoint detection result. Then, a close-up SRKDNet is executed to refine the results with
a cropped image of the ROI determined by the results of the full-view SRKDNet. The
dual-SRKDNet detection mechanism performs better than one-time detection, and the
keypoint detection accuracy is drastically improved. (3) Efficient virtual-and-real sampling
and neural network training methods are proposed and verified. The virtual sample data
are first used to train the neural network, and then a small number of real data are applied
to fine-tune the model. This method achieves accurate keypoint detection for real data
without consuming a huge amount of time and manpower. (4) The constraint equations for
solving the rotation angles of each joint are established; they depict the relation among the
detected keypoints in the multi-view images, the camera imaging model and the kinematic
model of the robot arm. A screening strategy based on the keypoint detection confidence
is incorporated in the solving process and is proved to be critical for ensuring attitude
estimation accuracy. Experiments demonstrate that the whole set of methods proposed in
this paper can realize high-accuracy estimation of robotic arm attitude without utilizing
cooperative visual markers.
The remaining contents of this paper are arranged as follows: In Section 2, we in-
troduce the whole set of methods, including the approaches to high-precision keypoint
detection (Section 2.1), automatic virtual sample generation (Section 2.2) and robot arm
attitude estimation (Section 2.3). Experiments on virtual and real robot arms are reported
in Section 3. We conclude the paper in Section 4.
Theselected
Figure1.1.The
Figure selectedkeypoints
keypointson/in
on/inthe
therobot
robotarm.
arm.
Each keypoint was directly selected on the 3D digital model of the robot arm when it
Each keypoint was directly selected on the 3D digital model of the robot arm when it
was in the zero position. To generate the sample data for neural network training, either
was in the zero position. To generate the sample data for neural network training, either
a virtual or real robot arm was controlled to move to the specified positions. The 3D
a virtual or real robot arm was controlled to move to the specified positions. The 3D co-
coordinates of the preset keypoints at each specified position (for instance, the position as
ordinates of the preset keypoints at each specified position (for instance, the position as
in Figure 1) could be obtained according to the kinematics of robotic arms [19], which will
in Figure 1) could be obtained according to the kinematics of robotic arms [19], which
be detailed in Section 2.2.2.
will be detailed in Section 2.2.2.
Given any predefined keypoint position in/on the digital model, its corresponding
Given any predefined keypoint position in/on the digital model, its corresponding
image point can be calculated according to the camera model. In this way, we obtained a
image point can be calculated according to the camera model. In this way, we obtained a
large number of training samples for the keypoint detection network. Experiments show
large number of training samples for the keypoint detection network. Experiments show
that with the predefined keypoints, the keypoint detection network works well and the
that with the predefined keypoints, the keypoint detection network works well and the
arm attitude estimation achieves high accuracy.
arm attitude estimation
This section beginsachieves high
with a brief accuracy. to the backbone network HRNet [11] used
introduction
This
in this section
paper. begins
Then, with a brief
we introduce theintroduction
idea of image to super-resolution
the backbone network HRNet [11]
reconstruction into
used in this paper. Then, we introduce the idea of image super-resolution
keypoint detection and propose a super-resolution keypoint detection network, SRKDNet,reconstruction
into
whichkeypoint detection
can alleviate and proposeofa the
the disadvantages super-resolution keypoint without
low-resolution heatmaps detectionsignificantly
network,
SRKDNet, which can alleviate the disadvantages of the low-resolution
increasing the computing resource consumption. A coarse-to-fine keypoint detectionheatmaps without
significantly increasing the computing resource consumption.
scheme based on dual SRKDNets is also presented in this section. A coarse-to-fine keypoint
detection scheme based on dual SRKDNets is also presented in this section.
2.1.1. Brief Introduction to HRNet
2.1.1. Brief Introduction
To retain to HRNet
high resolution for the feature maps in the forward propagation, the high-
To retain
resolution high resolution
network for thefeature
HRNet processes featuremapsmapsatinvarious
the forward propagation,
resolutions, as shown the
in
high-resolution
Figure 2. First, network HRNet processes
a preprocessing module isfeature
used tomaps at variousthe
downsample resolutions,
input image, as shown
which
inlowers
Figurethe
2. First, a preprocessing
resolution of the outputmodule is used
heatmaps as to downsample
well. The main the input image,
structure of HRNet whichcan
lowers the resolution
be divided of the
into several output
stages, andheatmaps as well.
the branches The mainresolutions
at different structure ofinHRNet can
each stage
beuse the residual
divided blocks
into several to extract
stages, and thefeatures.
branches After each stage,
at different a new branch
resolutions is created
in each stage use
without
the abandoning
residual blocks tothe original-resolution
extract features. Afterbranch. The new
each stage, a new branch
branch is obtained
is createdby strided
without
convolutions.
abandoning theThe length and width
original-resolution of theThe
branch. new feature
new branchmapis are reduced
obtained by to 1/2 of
strided the
con-
original, but
volutions. Thethe number
length and of channels
width of thebecomes
new featuretwicemapthatare
ofreduced
the original.
to 1/2Inofthe
thenew stage,
original,
thethe
but feature maps
number ofare createdbecomes
channels by fusing the multi-scale
twice feature maps
that of the original. In theofnew
eachstage,
branchtheinfea-
the
previous stage. The HRNet shown in Figure 2 has four branches with
ture maps are created by fusing the multi-scale feature maps of each branch in the pre- different resolutions.
The final
vious output
stage. feature shown
The HRNet map integrates
in Figurethe information
2 has extracted
four branches withfrom the four
different branches
resolutions.
and is used for generating the keypoint heatmap.
resolutions are converted to the same resolution first via upsampling or downsampling
(strided convolutions). Then, they are aggregated to obtain the output maps. HRNet
Sensors 2024, 24, x FOR PEER REVIEW has
5 of 22
powerful multi-scale image feature extraction capability and has been widely used in
classification recognition, semantic segmentation and object detection. We take HRNet as
Sensors 2024, 24, 305
the backbone in our super-resolution keypoint detection network (SRKDNet) which5 will of 21
The final output feature map integrates the information extracted from the four branches
be presented in the next subsection.
and is used for generating the keypoint heatmap.
The multi-scale fusion operation in HRNet is shown in Figure 3. The feature maps
from the branches with the same resolution remain the same while those with different
resolutions are converted to the same resolution first via upsampling or downsampling
(strided convolutions). Then, they are aggregated to obtain the output maps. HRNet has
powerful multi-scale image feature extraction capability and has been widely used in
classification recognition, semantic segmentation and object detection. We take HRNet as
the backbone in our super-resolution keypoint detection network (SRKDNet) which will
be presented in the next subsection.
Figure 2.
Figure Network structure
2. Network structure of
of HRNet.
HRNet.
The multi-scale fusion operation in HRNet is shown in Figure 3. The feature maps
from the branches with the same resolution remain the same while those with different
resolutions are converted to the same resolution first via upsampling or downsampling
(strided convolutions). Then, they are aggregated to obtain the output maps. HRNet
has powerful multi-scale image feature extraction capability and has been widely used in
Feature maps
Preprocessing recognition, semantic
classification in
segmentation and object detection. We take HRNet as
Heatmap
module different branches
the backbone in our super-resolution keypoint detection network (SRKDNet) which will be
presented in the next
Figure 2. Network subsection.
structure of HRNet.
proposed SRKDNet.
Figure 4. Structure of the proposed SRKDNet.
Adding a branch
branchat atthe theinput
inputside sidecan can provide
provide additional
additional shallow
shallow image
image features
features for
for
the the
finalfinal heatmap
heatmap generation.
generation. We believe
We believe that that the combination
the combination of shallow
of shallow and anddeepdeepfea-
feature information
ture information is is conducive
conducive toto keypointdetection.
keypoint detection.InInorder
orderto toavoid
avoid the
the loss
loss of image
information, the branch consists of only one convolution layer
information, the branch consists of only one convolution layer and one batch normaliza- and one batch normalization
layer, with with
tion layer, no activation
no activationlayerlayerinvolved.
involved.
The
The essence
essence ofof subpixel
subpixel convolution
convolution is is to
to rearrange
rearrange thethe pixels
pixels from
from aa low-resolution
low-resolution
image
image withwith more
more channels
channels in in aa specific
specific way way to to form
form aa high-resolution
high-resolution image image with with fewer
fewer
channels.
channels. As shown in Figure 5, pixels of different channels at the same position on the
As shown in Figure 5, pixels of different channels at the same position on the
low-resolution image with size (r2 , H,2W ) are extracted and composed into small squares
low-resolution image with size ( r , H , W ) are extracted and composed into small
with size (r, r ), which together form a high-resolution image with size (1, rH, rW ) Realizing
( r , r )
the subpixel convolution requires convoluting the alow-resolution
squares with size , which together form high-resolution image
feature mapswithfirstsize
to
(1, rH , rW
expand ) Realizing
the number theimage
of subpixel convolution
channels. For requiresthe
instance, convoluting
low-resolution the low-resolution
image in size
Sensors 2024, 24, x FOR PEER REVIEW 7 of 22
of (C, H,maps
feature W ) needs
firsttotobeexpand
expanded size (r2 C,ofH,image
theto number W ) via channels.
convolution Forbefore it can the
instance, be
converted to a high-resolution
low-resolution image in size of image
(C , H in
, W size
) of ( C, rH, rW ) . In our
needs to be expanded to size implementation,
( r 2
C , H , Wr )= 2.
via
convolution
(r , H , W ) before it can be converted
2
(r , r ) to a high-resolution
(1, rH , rW ) image in size of (C , rH , rW ) .
In our implementation, r = 2 .
Figure5.
Figure 5. Subpixel
Subpixel convolution
convolution processing.
processing.
Since the
Since the backbone
backbone network
network HRNet
HRNet has
has aa strong
strong ability
ability to
to detect
detect features
features in
in multiple
multiple
scales, SRKDNet
scales, SRKDNet doesdoes not
not adopt
adopt aa complex
complex super-resolution
super-resolution neural network structure.
Instead, only a subpixel convolution module is applied to enable the convolutional neu-
ral network to learn the knowledge for generating high-quality and high-resolution
heatmaps from low-resolution information. The subsequent experiments can prove its
significant effects in improving the neural network performance.
Sensors 2024, 24, 305 7 of 21
Instead, only a subpixel convolution module is applied to enable the convolutional neural
network to learn the knowledge for generating high-quality and high-resolution heatmaps
from low-resolution information. The subsequent experiments can prove its significant
effects in improving the neural network performance.
(a) (b)
(c) (d)
2.2.
2.2. Automatic
Automatic Sample
Sample Generation
Generation Based
Based on
on Virtual
Virtual Platform
Platform
2.2.1. Virtual Platform Construction
2.2.1. Virtual Platform Construction
The
Thetraining
trainingofofaaneural
neuralnetwork
networkrequires
requires a large number
a large number of sample data.
of sample TheThe
data. predic-
pre-
tive effect of the neural model is directly related to the quantity and quality of
dictive effect of the neural model is directly related to the quantity and quality of the the sample
data.
sampleTo data.
makeTo themake
neuralthe
network
neural fully “know”
network fullythe robot arm
“know” to be detected
the robot arm to beindetected
the imagein
and make the trained model have a more stable performance, the sample
the image and make the trained model have a more stable performance, the sample images should
im-
be taken
ages frombevarious
should perspectives,
taken from under various
various perspectives, backgrounds
under and lightingand
various backgrounds conditions.
lighting
Obviously,
conditions.collecting
Obviously,a large number
collecting of diverse
a large numbersample data insample
of diverse real industrial scenarios
data in real will
industrial
consume a lot of manpower and time.
scenarios will consume a lot of manpower and time.
For this paper, a virtual platform in UE4 [20] was established to simulate the work-
ing scene of the UR10 robot arm equipped with a working unit. The color, roughness,
high brightness and metallicity of the appearance of the real robot arm are presented in
the platform as far as possible. The base coordinate system of the UR10 robot arm is set as
the world coordinate system. The robot arm in the zero position is shown in Figure 7a. A
dictive effect of the neural model is directly related to the quantity and quality of the
sample data. To make the neural network fully “know” the robot arm to be detected in
the image and make the trained model have a more stable performance, the sample im-
ages should be taken from various perspectives, under various backgrounds and lighting
Sensors 2024, 24, 305 conditions. Obviously, collecting a large number of diverse sample data in real industrial 8 of 21
scenarios will consume a lot of manpower and time.
For this paper, a virtual platform in UE4 [20] was established to simulate the work-
ing scene
For thisof paper,
the UR10 robotplatform
a virtual arm equipped withwas
in UE4 [20] a working
establishedunit. toThe color,the
simulate roughness,
working
scene of the UR10
high brightness androbot arm equipped
metallicity with a working
of the appearance of theunit.
real The
robotcolor,
arm roughness,
are presented highin
brightness
the platform and
as metallicity of the
far as possible. Theappearance of thesystem
base coordinate real robot
of thearmUR10 arerobot
presented
arm isinsettheas
platform
the worldascoordinate
far as possible.
system.The
Thebase
robotcoordinate
arm in the system of the UR10
zero position is shownrobotinarm is set
Figure 7a.asA
the world coordinate system. The robot arm in the zero position is shown
movable skeleton and a parent–child relationship between the adjacent bones are created in Figure 7a. A
movable
accordingskeleton and a parent–child
to the structure and kineticrelationship between
characteristics of the the
robotadjacent
arm, asbones
shown areincreated
Figure
according to thehas
7b. Each bone structure
a headand
andkinetic characteristics
a tail node. The headofnodethe robot arm, as shown
is connected in Figure
to the parent 7b.
bone,
Each bone
and the tailhas
nodea head and a tail
is connected to node. The
the child headEach
bone. node is connected
bone can be rotated to thefreely
parent bone,
with the
and
headthe tailas
node node is connected
a reference point,toand
thethe
child bone.
child Each
bone bone can
connected to be
its rotated
tail nodefreely
moveswith
with theit
head node as a reference point, and the child bone connected to its tail
together. The 3D digital model of each arm segment is bound to the corresponding bone, node moves with it
together. The 3D digital model of each arm segment is bound to the corresponding
so as to drive the articulated arm to move together with the skeleton, as shown in Figure bone, so
as
7c.to drive the articulated arm to move together with the skeleton, as shown in Figure 7c.
In UE4,
In UE4, the
the motion
motion posture
posture and and speed
speed of of the
the robot
robot arm
arm can
can bebe easily
easily set;
set; one
one or
or more
more
virtual cameras can be deployed in the scene; the internal and external parametersthe
virtual cameras can be deployed in the scene; the internal and external parameters of of
cameras,
the cameras,as well as the
as well as lighting
the lightingconditions
conditionsand andthe background
the background of theofscene, can be
the scene, canflexi-
be
bly changed.
flexibly changed. In this way,
In this way,wewe virtually
virtually collected
collecteda alarge
largenumber
numberof of sample data under
sample data under
various backgrounds
various backgroundsand and lighting
lighting conditions
conditions for training
for training the SRKDNets.
the SRKDNets. The back-
The background
ground are
settings settings
randomlyare randomly
selected from selected from the
the images in theimages
COCOindataset
the COCO [21]. Adataset
moving [21].
lightA
movingis light
source used source is used in the
in the constructed constructed
virtual platform.virtual platform.
The position andThe positionofand
properties theprop-
light
erties of
source keepthechanging
light source keepthe
during changing
sample dataduring the sample
collection. Richdata collection.settings
background Rich back-
and
ground settings
lighting conditions andinlighting conditions
the sample data canin the
make sample data can
the neural make the
network neural network
insensitive to the
insensitive to the background/lighting
background/lighting changes and more changes
focusedand on more focused
extracting the on extracting
features the
of the fea-
robot
turesitself.
arm of the robot arm itself.
After the
After the virtual
virtual scene
scene with
with thethe robot
robot arm
arm isis established,
established, thethe virtual
virtual cameras
cameras taketake
virtual images
virtual images of of the
the scene
scene toto obtain
obtain thethe synthetic
synthetic imageimage of of the
the robot
robotarm.
arm. These
These virtual
virtual
images
images willwill serve
serve as as the
the training
training samples.
samples. The The attitude
attitude parameters of the the robot
robot arm,
arm, as as
well
well as
as the
the internal
internal andand external
external parameters
parameters of of the
the virtual
virtual cameras
cameras corresponding
corresponding to to each
each
virtual
virtualimage,
image,are arerecorded
recordedfor forimage
image labeling,
labeling, which
which will be be
will detailed in the
detailed nextnext
in the subsection.
subsec-
Figure 8 shows
tion. Figure three three
8 shows typical virtualvirtual
typical sample images
sample of the of
images robot arm. arm.
the robot
Figure 8.
Figure 8. Three
Three synthetic
synthetic sample
sample images
images of
ofthe
therobot
robotarm
armagainst
againstrandom
randombackgrounds.
backgrounds.
where 0 P
e j is the homogeneous form of 0 P , mP is the homogeneous form of mP
j
e
j j
and 0m T (θ1 , θ2 , θm ) ∈ R4×4 is the transformation matrix from Cm to C0 , which is
determined by the rotation angles θ1 , θ2 , θm of the m joints. The coordinate values of
m P do not change with the movement of the robot arm and can be determined in
j
advance according to the digital model of the robot arm.
4. According to the internal and external parameters of the virtual camera, the pixel
coordinates of each keypoint on the virtual image are calculated by using the camera
imaging model in Formula (3)
p j = K[R t]0 P
se ej (3)
are arranged to monitor the robot arm from different perspectives, then by combining
Equations (2) and (3), we have the following:
h i
elj = Kl Rl tl 0m T (θ1 , θ2 , · · · , θm )m P
sjp e j , j = 1, 2, · · · , J; l = 1, 2, · · · , L (4)
∼l
where p j denotes the homogeneous pixel coordinates of the keypoint Pj in the l-th camera’s
image plane; Kl , Rl , tl represent the intrinsic and extrinsic parameters of the l-th camera;
and θm is the rotation angle of the m-th joint. For all the keypoints in the multi-view images,
Formula (4) forms an equation system composed of L × J equations.
In the robot arm attitude monitoring process, the image coordinates plj (l = 1, 2, · · · , L)
of the keypoints Pj (j = 1, 2, · · · , J) in the L images are located via the proposed dual SRKD-
Nets; the camera parameters Kl , Rl , tl are known in advance (In the virtual experiments,
the camera parameters can be directly obtained from the settings. In the real experiments,
the intrinsic parameters are determined with the popular calibration method presented
in [23]. The relative geometry relationship between the UR10 robot arm and the cameras
was calibrated in advance with the well-established off-line calibration method presented
Sensors 2024, 24, x FOR PEER REVIEW 11 of 22
in [24,25].); the 3D coordinates of m P j are determined on the 3D digital model of the robot
arm. Therefore, after removing the scale factor s j , the unknowns in the equation system (4)
are only the rotation angles of the joints. The LM (Levenberg–Marquardt) algorithm [26]
obtain
can the joint
be used angles 𝜃the
to optimize 1 , 𝜃equation
2, , 𝜃𝑚 . The initial
system valuesthe
to obtain 𝜃1 , 𝜃2angles
of joint , , 𝜃𝑚θ1are
, θ2 ,randomly as-
· · · , θm . The
signed within their effective ranges.
initial values of θ1 , θ2 , · · · , θm are randomly assigned within their effective ranges.
2.3.2.
2.3.2. Keypoint
Keypoint Screening
Screening Based
Basedon onDetection
DetectionConfidence
Confidence
Some
Some keypoints, especially those on the first
keypoints, especially those on the first segment
segment or or on
on the
the flange
flange ofof the
the robot
robot
arm,
arm, are prone to be occluded by other arm segments in certain perspectives, as shownin
are prone to be occluded by other arm segments in certain perspectives, as shown in
Figure
Figure 9.9. When
When aa keypoint
keypoint is is blocked
blocked inin the
the image,
image, the
the detection
detection reliability
reliability of
of the
the neural
neural
network
networkwill willdecline,
decline,and
andthe error
the between
error between thethe
predicted position
predicted andand
position the real
the position will
real position
be larger
will (see Section
be larger 3 for the
(see Section experimental
3 for results). results).
the experimental The accuracy decline ofdecline
The accuracy the keypoint
of the
detection will inevitably increase the attitude estimation error.
keypoint detection will inevitably increase the attitude estimation error.
Figure9.
Figure 9. Three
Three examples
examplesof
ofrobot
robotarm
armwith
withself-occlusion
self-occlusionin
inreal
realimages.
images.
However, in
However, inthe
thecase
caseofofmonitoring
monitoring with with multi-view
multi-view images,
images, aa keypoint
keypoint is is not
not likely
likely
to be
to be occluded
occluded in in all
all the
the images.
images. Therefore,
Therefore, we we propose
propose aa keypoint
keypoint screening
screening scheme,
scheme,
which
whichisisbased
basedon onthe
thedetection
detectionconfidence
confidence ofof
thethe
keypoint,
keypoint, to to
improve
improve thethe
attitude estima-
attitude esti-
tion accuracy.
mation accuracy.
As
As mentioned
mentioned above,above, thethe value
value of
of each
each pixel
pixel in
in the
the heatmap
heatmap output
output byby the
the SRKDNet
SRKDNet
represents the probability that the image of the keypoint is located
represents the probability that the image of the keypoint is located on that pixel. on that pixel. The pixel
The
with
pixelthe
withlargest probability
the largest valuevalue
probability (i.e., the
(i.e.,detection confidence)
the detection in the
confidence) in heatmap
the heatmap willwill
be
selected as the
be selected asdetection
the detectionresult.result. L images
For theFor from different
the 𝐿 images perspectives,
from different each keypoint
perspectives, each
will have Lwill
keypoint detection
have results,
𝐿 detectionwhose confidence
results, whose values are different.
confidence valuesTheareL detection
different. results
The 𝐿
are sorted from high to low according to their confidence values. Then,
detection results are sorted from high to low according to their confidence values. Then, the results with low
confidence scores are discarded, and at least two results with the highest
the results with low confidence scores are discarded, and at least two results with the scores are kept.
The screened
highest scores results withThe
are kept. highscreened
detectionresults
qualitywith
are substituted
high detectioninto quality
Formulaare (4)substituted
so that the
into Formula (4) so that the attitude of the robot arm can be solved more accurately and
reliably. It should be noted that with this screening scheme, the number of equations in
(4) will be less than 𝐿 × 𝐽, but still far more than the number of unknowns. Therefore, it
can ensure robust solutions.
Sensors 2024, 24, 305 11 of 21
attitude of the robot arm can be solved more accurately and reliably. It should be noted
that with this screening scheme, the number of equations in (4) will be less than L × J, but
still far more than the number of unknowns. Therefore, it can ensure robust solutions.
3. Experiments
3.1. Experiments on Virtual Data
The virtual sample acquisition and labeling methods described in Section 2.2 were
used to generate 11,000 labeled sample images with a resolution of 640 × 640. We randomly
selected 9000 virtual samples as the training set and 1000 virtual samples as the validation
set. The validation set was not included in the training and was only used to verify the
effect of the model after each round of training. The other 1000 virtual samples served
as the test set to demonstrate the final effect of the model after all rounds of training. All
the sample images in this study were monochrome. Before the sample images were put
into the convolutional neural network for training, they were reduced to the resolution of
256 × 256. All the experiments in this study were performed on a Dell workstation with an
RTX2080S graphics card and 8 GB video memory.
1 1 1 1 N −1 C −1 H −1 W −1
N C H W i∑ ∑ ∑ ∑ [Gi ( j, h, w) − Hi ( j, h, w)]2
MSE = (5)
=0 j =0 h =0 w =0
SHNet [8], HRNet [11] and the proposed SRKDNet were trained with the generated
virtual data for the comparison of the keypoint detection performance among these models.
The PyTorch library was used to build and train the models. In the training of HRNet
and the proposed SRKDNet, the settings in Ref. [11] were adopted: Adam optimizer was
used; the initial learning rate was set to 0.001; the total training epoch was 45; the data
batch size was 8; the learning rate was reduced once every 15 rounds with a reduction
factor of 0.1. The weights of HRNet were obtained from the pre-trained HRNet on the
ImageNet [27] dataset. For the backbone network of the proposed SRKDNet, the same
initial weights and number of intermediate layers as in Ref. [11] were adopted. For SHNet,
two hourglass modules were stacked, and its training followed the settings in Ref. [8]: the
Rmsprop optimizer was used, the learning rate was initially set to 0.00025 and the neural
network was trained from scratch using Pytorch’s default weight initialization.
The resolution of the heatmaps output by both SHNet and HRNet was 64 × 64. The
standard deviation of the Gaussian distribution of the weights on the corresponding ground-
truth heatmap was set to 1 pixel. The resolution of the heatmaps output by SRKDNet
was 256 × 256, and the standard deviation of the Gaussian distribution of the weights on
the corresponding ground-truth heatmap was set to 3 pixels. The channel number of the
highest-resolution branch of HRNet and SRKDNet was 32. The channel number of the
feature maps in the supplementary branch containing shallow image features in SRKDNet
was set to 8.
Sensors 2024, 24, 305 12 of 21
where A is the total number of predicted results; ei is the pixel distance between the
predicted and the ground-truth positions; enorm is the standard error distance; τ is a specified
threshold to adjust the ratio between the calculated error distance in the experiments and
enorm . If the calculated distance error ei between the predicted and the true positions of
the keypoint is less than enorm × τ , δ equals 1 and the predicted position is considered
correct. The keypoint prediction result of our full-view SRKDNet will be compared with
that of SHNet and HRNet by using PCK as the metric. In our experiment, enorm was
set to 40 pixels and τ was assigned as 0.2, 0.15 or 0.1. Considering that the three13neural
Sensors 2024, 24, x FOR PEER REVIEW of 22
networks output heatmaps with different resolutions, but the detected keypoint positions
need to be mapped back to the original sample images to conduct the subsequent robot arm
attitude estimation, we mapped the predicted coordinates of all keypoints to the original
of the three640
resolution methods, where
× 640 for [email protected],Table
comparison. [email protected]
1 listsand
the [email protected]
PCK valuesrepresent the prediction
of the three methods,
accuracy with 𝜏 [email protected]
where [email protected], 0.2, 𝜏 = 0.15 and 𝜏 = 0.1,
and [email protected] respectively.
represent the prediction accuracy with τ = 0.2,
τ = 0.15 and τ = 0.1, respectively.
Table 1. Experimental results of keypoint detection on virtual samples.
Table 1. Experimental results of keypoint detection on virtual samples.
Methods [email protected] [email protected] [email protected]
SHNet
Methods [email protected] 93.17% 87.27%
[email protected] 66.06%
[email protected]
SHNetHRNet 93.17% 94.91% 88.79%
87.27% 68.14%
66.06%
full-view
HRNet SRKDNet 94.91% 96.23% 94.33%
88.79% 89.07%
68.14%
full-view SRKDNet 96.23% 94.33% 89.07%
The results in Table 1 show that the trained full-view SRKDNet completely outper-
formsThetheresults
two comparison models
in Table 1 show SHNet
that and HRNet
the trained under
full-view all threecompletely
SRKDNet threshold values.
outper-
The
formssmaller thecomparison
the two threshold is,models
the more obvious
SHNet andthe superiority
HRNet under of allthe full-view
three threshold SRKDNet
values.
over the twothe
The smaller comparison
threshold models is. The
is, the more reasons
obvious thefor the superiority
superiority may lie in
of the full-view two as-
SRKDNet
pects: (1)two
over the Using the heatmaps
comparison models with
is. aThe
higher resolution
reasons (256 × 256)may
for the superiority in the
lie training labels
in two aspects:
can reduce
(1) Using thethe negative
heatmaps influence
with a higherof the downsampling
resolution (256 × 256) in operation.
the training(2)labels
The can
predictive
reduce
heatmap withinfluence
the negative the trained super-resolution
of the downsampling layer can express
operation. the detected
(2) The predictive keypoints
heatmapmore with
accurately.
the trained super-resolution layer can express the detected keypoints more accurately.
Figure 10 shows
Figure 10 showsthe thedetection
detection results
results of three
of the the three
keypointkeypoint
detectiondetection
networks networks
SHNet,
HRNet HRNet
SHNet, and ourand full-view SRKDNet
our full-view for thefor
SRKDNet same
the test
sameimage. The The
test image. green dotsdots
green are are
the
realreal
the locations of the
locations of keypoints, and the
the keypoints, andblue
the dots
blue are
dotsthe
arepredicted locations.
the predicted The mean
locations. The
error refers
mean error to the average
refers of the pixel
to the average of the distances betweenbetween
pixel distances the predicted locationslocations
the predicted and the
real locations of all the keypoints. The mean prediction error of full-view
and the real locations of all the keypoints. The mean prediction error of full-view SRK- SRKDNet is
significantly
DNet lower than
is significantly thatthan
lower of SHNet
that ofand SHNetHRNet.
andWe can also
HRNet. Weintuitively see that most
can also intuitively see
keypoint
that most locations
keypointpredicted
locationsby the full-view
predicted by the SRKDNet
full-vieware closer to are
SRKDNet the closer
real location
to the than
real
those predicted
location by the
than those comparison
predicted by themethods.
comparison methods.
Mean error: 3.22 pixel Mean error: 2.92 pixel Mean error: 1.48 pixel
Figure
Figure 10. Comparison of
10. Comparison ofkeypoint
keypointdetection
detectionresults:
results:
(a)(a) SHNet;
SHNet; (b) (b) HRNet;
HRNet; (c) full-view
(c) full-view SRK-
SRKDNet.
DNet.
Maximum Maximum
confidence: 0.116 confidence: 0.774
Figure 11.
Figure 11. The
The influence
influence of
of self-occlusion
self-occlusion on
on keypoint
keypoint detection.
detection.
The
The GPU
GPU (graphics
(graphics processing
processing unit)
unit) memory
memory occupation
occupation ofof the
the full-view
full-view SRKDNet
SRKDNet
was
was also compared with that of SHNet and HRNet with the batch size set to training.
also compared with that of SHNet and HRNet with the batch size set to 8 in the 8 in the
The resultThe
training. is shown
resultinisTable
shown 2. The output
in Table 2. heatmap
The outputresolution
heatmapof the three convolutional
resolution of the three
neural networks
convolutional is shown
neural in parentheses.
networks is shown in parentheses.
Table 2 shows that HRNet occupies the least GPU memory during the training and
Table 2. Comparison
outputs heatmaps withof GPU memory occupation
a resolution (batch
of only 64 size
× 64. = 8).
The SRKDNet occupies 22.6% more
GPU memory resources than HRNet. This demonstrates that the proposed full-view
Neural Network Model GPU Occupation
SRKDNet can remarkably improve the detection accuracy (see Table 1) at the expense of
a mild increaseSHNet
in GPU × 64)
(64occupation. 3397 MB
HRNet (64 × 64) 3031 MB
full-view SRKDNet (256 × 256) 3717 MB
Table 2. Comparison of GPU memory occupation (batch size = 8).
Table 2 showsNeural
thatNetwork Model the least GPU memory
HRNet occupies GPU Occupation
during the training and
SHNet (64 × 64)
outputs heatmaps with a resolution of only 64 × 64. The SRKDNet occupies 3397 MB 22.6% more
GPU memory resources HRNetthan
(64 ×HRNet.
64) This demonstrates that the3031 MB
proposed full-view
SRKDNet can full-view SRKDNet
remarkably improve(256the
× 256)
detection accuracy (see Table3717
1) atMB
the expense of a
mild increase in GPU occupation.
For
For further
furthercomparison,
comparison,we wecanceled
canceled thethe
downsampling
downsamplingoperations in the
operations preprocess-
in the prepro-
ing stagestage
cessing of HRNet so that
of HRNet HRNet
so that can also
HRNet output
can also heatmaps
output heatmaps a 256a ×
withwith 256256 resolution,
× 256 resolu-
tion, which is the same as that of the full-view SRKDNet. However, the maximum batch
size of HRNet can only be set to 2 in this situation, and the experimental results are
shown in Table 3. These results demonstrate that to enable HRNet to output heatmaps
with the same resolution as that of the full-view SRKDNet, the GPU memory resource
consumption will increase sharply. When batch size = 2, the GPU occupation of HRNet
Sensors 2024, 24, 305 14 of 21
which is the same as that of the full-view SRKDNet. However, the maximum batch size
of HRNet can only be set to 2 in this situation, and the experimental results are shown
in Table 3. These results demonstrate that to enable HRNet to output heatmaps with the
same resolution as that of the full-view SRKDNet, the GPU memory resource consumption
will increase sharply. When batch size = 2, the GPU occupation of HRNet exceeds 277.03%
compared with our full-view SRKDNet.
Table 4 shows that the PCK score of dual SRKDNets is higher than that of the full-view
SRKDNet, which means that the use of the close-up SRKDNet can effectively improve
the keypoint detection accuracy. When a more stringent threshold is set, a more obvious
improvement can be achieved. When τ = 0.05, in other words, when the distance threshold
between the detected and the real keypoint positions was set to 2 pixels, the keypoint
detection accuracy increased from 62.14% to 93.92%. When the threshold was assigned as
0.1, the PCK score of the close-up SRKDNet increased to 97.66%, compared to 89.07% of
the full-view SRKDNet.
The above experimental results demonstrate that the proposed successive working
mechanism of the dual SRKDNets is quite effective. The close-up SRKDNet can further
improve the keypoint detection accuracy by a large margin.
each of the 1000 sets of images were recorded as the ground-truth values of the 1000 attitude
estimation experiments. The full-view SRKDNet trained as described in Section 3.1.2 and
the close-up SRKDNet trained as described in Section 3.1.4 were used for the coarse-to-fine
keypoint detection.
Sensors 2024, 24, x FOR PEER REVIEW 16 of 22
The comparison experiments of single-view and multi-view attitude estimation, as
well as the comparison of using and not using the confidence-based keypoint screening
scheme, were conducted. The specific keypoint screening method for the four-perspective
four-perspective
sample sample
images adopted images
in the adopted
attitude in theexperiments
estimation attitude estimation experiments
was as follows: was as
The detection
follows: The
keypoints detection
with the top keypoints
three highest with the top three
confidence highest
values wereconfidence
kept. If thevalues
fourthwere kept.
detection
If the fourth detection result had a confidence score greater than 0.9, it would
result had a confidence score greater than 0.9, it would also be retained; otherwise, it would also be re-
tained; otherwise,
be discarded. it would be discarded.
The average
The average error
error of
of the
the estimated
estimated rotation
rotation angles
angles ofof the
the 1000
1000 experiments
experiments of of each
each
joint is
joint is shown
shown in in Table
Table 5.5. “Single
“Single view”
view” means
means the
the attitude
attitude estimation
estimation waswas performed
performed
based on
based on the
the information
information fromfrom oneone single-perspective
single-perspective image
image (we(we randomly
randomly selected
selected thethe
1000 sample images collected by the second camera); “four
1000 sample images collected by the second camera); “four views” means that imageviews” means that image
from all
from all four
four perspectives
perspectives was was used
used in inthe
theattitude
attitudeestimation;
estimation; “four
“fourviews
views++ confidence
confidence
screening” means
screening” means thethe multi-view
multi-view keypoint
keypoint screening
screening scheme
scheme was was utilized
utilized on
on the
the basis
basis ofof
“four-perspective”.
“four-perspective”.
Table5.5. Robot
Table Robot arm
arm attitude
attitude estimation
estimationerrors
errorsbased
basedon
onvirtual
virtualimages
images(unit:
(unit:degree).
degree).
The above experimental results demonstrate that the average estimation error of the
The above experimental results demonstrate that the average estimation error of the
joint angles using images from four perspectives was reduced by 76.60% compared with
joint angles using images from four perspectives was reduced by 76.60% compared with
that using the information from one perspective only. The confidence-based keypoint
that using the information from one perspective only. The confidence-based keypoint
screening scheme
screening scheme further
further reduced
reduced the
the average
average error
error of
of the
the four-view
four-viewattitude
attitude estimation
estimation
by 46.23%.
by 46.23%.The
Thecompound
compound accuracy
accuracy increase
increase reaches
reaches nearlynearly an order
an order of magnitude,
of magnitude, which
which proves that the whole set of methods proposed in this paper is
proves that the whole set of methods proposed in this paper is very effective. very effective.
3.2. Experiments
3.2. Experiments on
on Real
Real Robot
Robot Arm
Arm
3.2.1.
3.2.1. Real
Real Data
Data Acquisition
Acquisition
The
The scene
sceneof
ofaareal
realrobot
robotarm
armattitude
attitudeestimation
estimationexperiment
experimentis shown
is shownin in
Figure 12,12,
Figure in
which three
in which cameras
three are distributed
cameras around
are distributed a UR10
around robotrobot
a UR10 arm. arm.
The intrinsic parameters
The intrinsic of
parame-
the
terscameras and the and
of the cameras transformation matrix ofmatrix
the transformation each camera
of eachcoordinate system relative
camera coordinate to the
system rel-
base coordinate system of the robot arm were calibrated in advance by using well-studied
ative to the base coordinate system of the robot arm were calibrated in advance by using
methods [23–25].
well-studied methods [23–25].
Camera 1 Camera 2
Camera 3
Figure 12.
Figure 12. Experiment
Experiment scene
scene of
of real
real robot
robotarm
armattitude
attitude estimation.
estimation.
We planned 648 positions for the flange endpoint of the robot arm in its working
space as the sample positions, as shown in Figure 13. Each sample position corresponded
to a set of six joint angles. After the robot arm reached each sample position, the three
cameras collected images synchronously and automatically recorded the current joint
angles and the 3D coordinates of the center endpoint of the flange in the base coordinate
Sensors 2024, 24, 305 16 of 21
We planned 648 positions for the flange endpoint of the robot arm in its working space
as the sample positions, as shown in Figure 13. Each sample position corresponded to a
set of six joint angles. After the robot arm reached each sample position, the three cameras
collected images synchronously and automatically recorded the current joint angles and
Sensors 2024, 24, x FOR PEER REVIEW the 3D coordinates of the center endpoint of the flange in the base coordinate17system.
of 22 This
process was repeated until the real sample data collection was completed. A total of 1944
images were captured by the three real cameras.
The resolution of the industrial camera used in the experiment was 5120 × 5120. To
The resolution of the industrial camera used in the experiment was 5120 × 5120. To
facilitate the training and prediction of the keypoint detection networks, and to unify
facilitate the training and prediction of the keypoint detection networks, and to unify the
the experimental standards, the resolution of the collected real images was reduced to
experimental standards, the resolution of the collected real images was reduced to 640 ×
640 × 640, which was the same as the resolution of the virtually synthesized images. The
640, which was the same as the resolution of the virtually synthesized images. The de-
detected keypoint positions were mapped back to the initial images for the robot arm
tected keypoint positions were mapped back to the initial images for the robot arm atti-
attitude estimation.
tude estimation.
3.2.2. Keypoint Detection Experiment on Real Robot Arm
3.2.2. Keypoint Detection Experiment on Real Robot Arm
The real UR10 robot arm is consistent with the digital model in the virtual sampling
The real UR10
platform. robot armallis the
Therefore, consistent
settingswith the digital
in the model in
experiments in Section
the virtual3.1,sampling
the geometric
platform.
parameters of the robot arm, the 3D coordinates of the keypoints in the armpa-
Therefore, all the settings in the experiments in Section 3.1, the geometric segment
rameters of the robot
coordinate systemarm, thethe
and 3Dkinematic
coordinates of the
model keypoints
of the robot armin the
werearm segment
also appliedcoor-to the real
dinaterobot
system armand the kinematic
attitude estimation model of the robot arm were also applied to the real
experiments.
robot arm attitude
When the estimation
full-viewexperiments.
SRKDNet trained using the virtual samples as described in Sec-
When the full-view
tion 3.1.2 was usedSRKDNet
to detect trained using the
the keypoints virtual
in the realsamples
images, as itsdescribed
detectionin Sec-
accuracy on
tion 3.1.2
realwas
imagesused wasto only
detect34.11%,
the keypoints
20.86% and in the real when
7.06% images, theits detectionτ was
threshold accuracy
set toon0.15, 0.1
real images wasrespectively.
and 0.05, only 34.11%,Therefore,
20.86% and we7.06% whenusing
considered the threshold τ was set
the real sample datatoto0.15, 0.1 the
fine-tune
and 0.05, respectively.
trained model. Therefore, we considered using the real sample data to fine-tune
the trained From
model.the 1944 images (648 sets of triple-view real data) obtained in Section 3.2.1, 99 real
From
samplethedata
1944(33images
sets) (648
weresets of triple-view
randomly selectedreal data) obtained
as training in Section
sets. Another 3.2.1, 99
randomly selected
99 realdata
real sample sample
(33 data
sets) (33
were sets) served as
randomly the validation
selected as trainingsets.sets.
TheAnother
remaining 1746 samples
randomly
(582
selected 99sets)
real were
sample used as the
data (33 test
sets)sets to evaluate
served as the the performance
validation of keypoint
sets. The remaining detection
1746 and
samplesattitude estimation.
(582 sets) were used as the test sets to evaluate the performance of keypoint de-
tection andThe full-view
attitude SRKDNet and the close-up SRKDNet pre-trained with virtual sample
estimation.
data
The were both
full-view fine-tuned
SRKDNet andwiththe the training
close-up sets. Forpre-trained
SRKDNet comparison,with we also tried
virtual the method
sample
in which
data were boththe full-view with
fine-tuned SRKDNet and the close-up
the training sets. ForSRKDNet
comparison, werewe trained
also not with
tried thevirtual
method in which the full-view SRKDNet and the close-up SRKDNet were trained not
with virtual sample data but directly with the real training sets. Since the number of real
samples used for training was very small, the training epoch was set to 300. The learning
rate was reduced once for every 100 epochs with a reduction coefficient of 0.1. The other
settings were consistent with those in Section 3.1.1. The keypoint detection results of the
Sensors 2024, 24, 305 17 of 21
sample data but directly with the real training sets. Since the number of real samples
used for training was very small, the training epoch was set to 300. The learning rate was
reduced once for every 100 epochs with a reduction coefficient of 0.1. The other settings
were consistent with those in Section 3.1.1. The keypoint detection results of the 1746
real test samples with the model trained with these methods are shown in Table 6. The
comparison of all the experimental results was still evaluated at a resolution of 640 × 640.
The first row displays the result of applying only the full-view SRKDNet trained with
the 99 real sample data (training sets) to detect the keypoints. The second row displays
the result of applying only the full-view SRKDNet trained with our virtual datasets and
fine-tuned with the 99 real sample data (training sets) to detect the keypoints. The keypoint
detection results of dual SRKDNets trained with the 99 real sample data (training sets) are
shown in the third row. The last row displays the result of dual SRKDNets trained with
our virtual datasets and fine-tuned with the 99 real sample data (training sets). The initial
weights of the backbone network of these models were obtained from HRNet pre-trained
with the ImageNet dataset.
Table 6 shows that no matter whether full-view SRKDNet only or dual SRKDNets
were used for the keypoint detection, the model trained with the virtual samples generated
in Section 3.1.2 and fine-tuned with 99 real sample data (training sets) demonstrates much
better detection accuracy on the real test. When the full-view SRKDNet and close-up
SRKDNet were trained with no virtual data but only the 99 real sample data (training sets),
the keypoint detection accuracy values on the real test images were obviously lower. The
results of this experiment verify the following: (1) the virtual samples generated with the
proposed method in Section 2.2 have a significant positive effect on the keypoint detection
of the robot arm in real scenes; (2) small amounts of real samples can efficiently re-train the
model having been trained with virtual samples and achieve high generalization on real
robot arms.
An example of keypoint detection in a realistic scenario is shown in Figure 14, where
the first row shows the situation of using the full-view SRKDNet only, and the second row
shows the situation of using the full-view SRKDNet and the close-up SRKDNet. The first
column in Figure 14 shows the input real images, the second column shows the achieved
heatmaps and the third column illustrates the keypoint detection results.
high generalization on real robot arms.
An example of keypoint detection in a realistic scenario is shown in Figure 14, where
the first row shows the situation of using the full-view SRKDNet only, and the second
row shows the situation of using the full-view SRKDNet and the close-up SRKDNet. The
Sensors 2024, 24, 305 18 ofthe
21
first column in Figure 14 shows the input real images, the second column shows
achieved heatmaps and the third column illustrates the keypoint detection results.
Figure 14.
Figure 14. Keypoint
Keypoint detection
detection in
in realistic
realistic scenario.
scenario.
Table 7. Experimental results of real robot arm attitude estimation (unit: degree).
The experimental results in Table 7 show that the whole set of methods proposed in
this paper can achieve high-precision attitude estimation for the robot arm under realistic
scenarios. The total average error of all of the six joints is only 0.53◦ , which is even slightly
better than the average error of 0.57◦ in the attitude estimation for the virtual test samples
in Table 5. Here, we briefly analyze the reasons. When sampling in the virtual environment,
each joint angle and the camera poses were set at random. Therefore, a certain number of
virtual samples are unrealistic, such as interference existing between the arm segments,
and positions of the arm segments being too tight, which results in self-occlusion of some
keypoints in all perspectives. It is hard to estimate the attitude of the robot arm with these
virtual samples. However, in the real scenario, only the normal working attitudes appear
in the sample data, with no extremely strange attitudes for the sake of safety. This may
explain the reason why the attitude estimation accuracy using three-view information in
real scenes is slightly better than that using four-view information in a virtual dataset.
The average joint angle estimation errors of a four-joint robot arm reported in Ref. [17]
and Ref. [18] are 4.81 degrees and 5.49 degrees, respectively, while the average error of the
estimated joint angle of the six-joint UR10 robot arm with our method is only 0.53 degrees,
which is significantly lower. The reasons may lie in three aspects: (1) The SRKDNet
proposed in this paper learns the heatmaps with higher resolution by adding a subpixel
Sensors 2024, 24, 305 19 of 21
convolutional layer. In addition, the combination detection scheme based on the full-view
and close-up dual SRKDNets significantly improves the detection accuracy of the keypoints.
(2) The existing methods only use images from one single view, while our method uses
multi-view images, which can effectively alleviate the problems of self-occlusion and depth
ambiguity exhibited in single-view images. Moreover, the negative influence of improper
keypoint detection results can be greatly reduced by using the information redundancy
of the multi-view images and the confidence-based keypoint screening scheme. (3) The
existing methods not only estimate the joint angles of the robot arm but also estimate the
position and attitude of the robot arm relative to the camera, so there are 10 unknowns to
be estimated. Considering that the base of the robot arm is usually fixed in the industrial
scenes, we determine the relative geometry relationship between the robot arm and the
cameras through a well-established off-line calibration process to simplify the problem
to six unknowns. The above aspects of our method together contribute to the significant
accuracy improvements.
We also counted the time consumed in the two stages of keypoint detection and at-
titude estimation in the real scene, as shown in Table 8. The total time used for keypoint
detection in the three-perspective images using the dual SRKDNets was 0.28 s. The reso-
lution of the sample images used for the keypoint detection here was still 640 × 640. The
time required for solving the joint angles of the real robot arm was 0.09 s.
4. Conclusions
We have proposed a set of methods for accurately estimating the robot arm attitude
based on multi-view images. By incorporating a subpixel convolution layer into the back-
bone neural network, we put forward the SRKDNet to output high-resolution heatmaps
without significantly increasing the computational resource consumption. A virtual sample
generation platform and a keypoint detection mechanism based on dual SRKDNets were
proposed to improve the keypoint detection accuracy. The keypoint prediction accuracy
for the real robot arm is up to 96.07% for [email protected] (i.e., the position deviation between
the predicted and the real keypoints is within 6 pixels). An equation system, involving the
camera imaging model, the robot arm kinematic model, and the keypoints detected with
confidence values, was established and solved to finally obtain the rotation angles of the
joints. The confidence-based keypoint screening scheme makes full use of the information
redundancy of the multi-view images and is proven to be effective in ensuring attitude
estimation. Plenty of experiments on virtual and real robot arm samples were conducted,
and the results show that the proposed method can significantly improve the robot arm
attitude estimation accuracy. The average estimation error of the joint angles of the real
six-joint UR10 robot arm under three views is as low as 0.53 degrees, which is much higher
than that of the comparison methods. The entire proposed method is more suitable for
industrial applications with high precision requirements for robot arm attitude estimation.
In the real triple-view monitoring scenario, a total of 0.37 s was required for the
keypoint detection stage and the attitude-solving stage. The keypoint detection accounted
for the most time. The reason lies in that our method needs to detect keypoints in multi-
view images with dual SRKDNets. Therefore, the efficiency of the proposed method is
lower than that of the single-view-based method.
In this study, we only conducted experiments on one U10 robot arm. In the future, we
will try to extend our method to real industrial scenes with more types of robot arms.
Author Contributions: The work described in this article is the collaborative development of all
authors. Conceptualization, L.Z. (Liyan Zhang); methodology; L.Z. (Ling Zhou) and R.W.; software,
Sensors 2024, 24, 305 20 of 21
R.W.; validation, L.Z. (Ling Zhou) and R.W.; formal analysis, L.Z. (Ling Zhou); investigation, L.Z.
(Ling Zhou) and R.W.; resources, L.Z. (Liyan Zhang); data curation, R.W.; writing—original draft
preparation, L.Z. (Ling Zhou); writing—review and editing, R.W. and L.Z. (Liyan Zhang); visual-
ization, R.W. and L.Z. (Ling Zhou); supervision, (Liyan Zhang); project administration, L.Z. (Liyan
Zhang); funding acquisition, L.Z. (Liyan Zhang). All authors have read and agreed to the published
version of the manuscript.
Funding: This research was funded by the National Science Foundation of China (Grant number
52075260) and the Key Research and Development Program of Jiangsu Province, China (Grant number
BE2023086).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data are available from the corresponding author on reasonable
request.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Lin, L.; Yang, Y.; Song, Y.; Nemec, B.; Ude, A.; Rytz, J.A.; Buch, A.G.; Krüger, N.; Savarimuthu, T.R. Peg-in-Hole assembly under
uncertain pose estimation. In Proceedings of the 11th World Congress on Intelligent Control and Automation, Shenyang, China,
29 June–4 July 2014; pp. 2842–2847.
2. Smys, S.; Ranganathan, G. Robot assisted sensing, control and manufacture in automobile industry. J. ISMAC 2019, 1, 180–187.
3. Bu, L.; Chen, C.; Hu, G.; Sugirbay, A.; Sun, H.; Chen, J. Design and evaluation of a robotic apple harvester using optimized
picking patterns. Comput. Electron. Agric. 2022, 198, 107092. [CrossRef]
4. Lu, G.; Li, Y.; Jin, S.; Zheng, Y.; Chen, W.; Zheng, X. A realtime motion capture framework for synchronized neural decoding. In
Proceedings of the 2011 IEEE International Symposium on VR Innovation, Singapore, 19–20 March 2011. [CrossRef]
5. Verma, A.; Kofman, J.; Wu, X. Application of Markerless Image-Based Arm Tracking to Robot-Manipulator Teleoperation. In
Proceedings of the 2004 First Canadian Conference on Computer and Robot Vision, 2004, Proceedings, London, ON, Canada,
17–19 May 2004; pp. 201–208.
6. Liang, C.J.; Lundeen, K.M.; McGee, W.; Menassa, C.C.; Lee, S.; Kamat, V.R. A vision-based marker-less pose estimation system for
articulated construction robots. Autom. Constr. 2019, 104, 80–94. [CrossRef]
7. Toshev, A.; Szegedy, C. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the 2014 IEEE Conference
on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660.
8. Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference
on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499.
9. Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In
Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June
2018; pp. 7103–7112.
10. Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNet: Pixel-wise voting network for 6DoF pose estimation. In Proceedings of the
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4556–4565.
11. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696.
12. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach.
Intell. 2016, 38, 295–307. [CrossRef] [PubMed]
13. Kim, J.; Lee, J.; Lee, K. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654.
14. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017;
pp. 1132–1140.
15. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video
super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883.
16. Widmaier, F.; Kappler, D.; Schaal, S.; Bohg, J. Robot arm pose estimation by pixel-wise regression of joint angles. In Proceedings
of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; pp. 616–623.
17. Labbé, Y.; Carpentier, J.; Aubry, M.; Sivic, J. Single-view robot pose and joint angle estimation via render & compare. In
Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June
2021; pp. 1654–1663.
Sensors 2024, 24, 305 21 of 21
18. Zuo, Y.; Qiu, W.; Xie, L.; Zhong, F.; Wang, Y.; Yuille, A.L. CRAVES: Controlling robotic arm with a vision-based economic system.
In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20
June 2019; pp. 4209–4218.
19. Liu, Q.; Yang, D.; Hao, W.; Wei, Y. Research on Kinematic Modeling and Analysis Methods of UR Robot. In Proceedings of the
2018 IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 14–16 December
2018; pp. 159–164.
20. Sanders, A. An Introduction to Unreal Engine 4; CRC Press: Boca Raton, FL, USA; Taylor & Francis Group: Abingdon, UK, 2017.
21. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in
context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755.
22. Qiu, W.; Zhong, F.; Zhang, Y.; Qiao, S.; Xiao, Z.; Kim, T.S.; Wang, Y. UnrealCV: Virtual worlds for computer vision. In Proceedings
of the 2017 ACM, Tacoma, WA, USA, 18–20 August 2017; pp. 1221–1224.
23. Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [CrossRef]
24. Strobl, K.; Hirzinger, G. Optimal hand-eye calibration. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent
Robots and Systems, Beijing, China, 9–13 October 2006; pp. 4647–4653.
25. Park, F.; Martin, B. Robot sensor calibration: Solving AX=XB on the euclidean group. IEEE Trans. Robot. Autom. 1994, 10, 717–721.
[CrossRef]
26. Levenberg, K. A method for the solution of certain problems in least squares. Quart. Appl. Mach. 1944, 2, 164–168. [CrossRef]
27. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.