0% found this document useful (0 votes)

5 views

sensors-24-00305

This paper presents a two-stage method for accurately estimating the attitude of robot arms using multi-view images and a super-resolution keypoint detection network (SRKDNet). The SRKDNet improves keypoint detection accuracy through a novel subpixel convolution module and a dual-SRKDNet detection mechanism, achieving a detection accuracy of 96.07% and a joint angle estimation error of only 0.53 degrees. The proposed method eliminates the need for cooperative visual markers, enhancing efficiency in real-world applications of robot arm monitoring.

Uploaded by

Janowski Jasiu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

sensors-24-00305

Uploaded by

Janowski Jasiu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

sensors

Article
Accurate Robot Arm Attitude Estimation Based on Multi-View
Images and Super-Resolution Keypoint Detection Networks
Ling Zhou, Ruilin Wang and Liyan Zhang *

College of Mechanical & Electrical Engineering, Nanjing University of Aeronautics and Astronautics,
Nanjing 210016, China; [email protected] (L.Z.); [email protected] (R.W.)
* Correspondence: [email protected]

Abstract: Robot arm monitoring is often required in intelligent industrial scenarios. A two-stage
method for robot arm attitude estimation based on multi-view images is proposed. In the first stage,
a super-resolution keypoint detection network (SRKDNet) is proposed. The SRKDNet incorporates
a subpixel convolution module in the backbone neural network, which can output high-resolution
heatmaps for keypoint detection without significantly increasing the computational resource con-
sumption. Efficient virtual and real sampling and SRKDNet training methods are put forward. The
SRKDNet is trained with generated virtual data and fine-tuned with real sample data. This method
decreases the time and manpower consumed in collecting data in real scenarios and achieves a better
generalization effect on real data. A coarse-to-fine dual-SRKDNet detection mechanism is proposed
and verified. Full-view and close-up dual SRKDNets are executed to first detect the keypoints and
then refine the results. The keypoint detection accuracy, [email protected], for the real robot arm reaches
up to 96.07%. In the second stage, an equation system, involving the camera imaging model, the
robot arm kinematic model and keypoints with different confidence values, is established to solve the
unknown rotation angles of the joints. The proposed confidence-based keypoint screening scheme
makes full use of the information redundancy of multi-view images to ensure attitude estimation
accuracy. Experiments on a real UR10 robot arm under three views demonstrate that the average
estimation error of the joint angles is 0.53 degrees, which is superior to that achieved with the
comparison methods.
Citation: Zhou, L.; Wang, R.; Zhang,
L. Accurate Robot Arm Attitude
Keywords: robot arm; attitude estimation; super-resolution keypoint detection network (SRKDNet);
Estimation Based on Multi-View
multi-view images
Images and Super-Resolution
Keypoint Detection Networks. Sensors
2024, 24, 305. https://doi.org/
10.3390/s24010305
1. Introduction
Academic Editors: Teng Huang,
In the context of intelligent manufacturing, robot arms with multiple joints play
Qiong Wang and Yan Pang
increasingly important roles in various industrial fields [1,2]. For instance, robot arms
Received: 19 November 2023 are utilized to accomplish automatic drilling, riveting and milling tasks in aerospace
Revised: 24 December 2023 manufacturing; in automobile and traditional machinery manufacturing fields, robot arms
Accepted: 27 December 2023 can be frequently seen in automatic loading/unloading, automatic measurement and other
Published: 4 January 2024 production or assembly tasks.
In most industrial applications, a robot arm works in accordance with the pre-planned
program. However, on occasions where the robot arm becomes out of control by mistake,
serious collision or injury accidents may occur, especially in the work context of human–
Copyright: © 2024 by the authors.
machine cooperation. Therefore, it is critical to configure monitoring means to ensure safety.
Licensee MDPI, Basel, Switzerland.
On-site attitude monitoring of working robot arms is also essential for the collaborative
This article is an open access article
distributed under the terms and
work of multiple robot arms.
conditions of the Creative Commons
Machine vision is one of the most suitable and widely used monitoring means due
Attribution (CC BY) license (https://
to its relatively low cost, high applicability and good accuracy. To reduce the difficulty of
creativecommons.org/licenses/by/ image feature recognition and to improve monitoring accuracy and reliability, a common
4.0/). method in industry is to arrange cooperative visual targets on the monitored object [3,4].

Sensors 2024, 24, 305. https://doi.org/10.3390/s24010305 https://www.mdpi.com/journal/sensors

Sensors 2024, 24, 305 2 of 21

However, arranging cooperative targets is usually cumbersome and time-consuming, and

the targets in industrial sites tend to suffer from being stained or falling off. Studying
methods for accurately estimating the attitude of robot arms without relying on cooperative
visual markers presents a significant research challenge [5,6].
With their excellent ability to extract image features, deep neural networks have been
widely used in the field of computer vision. They can extract natural feature information
and deep semantic information from images and realize various computer vision tasks
based on the rich extracted information without relying on cooperative visual markers.
The base of the robot arm in most industrial scenes is fixed on the ground or a workbench.
In this situation, the motion attitude of the robot arm is completely determined by the
rotation angle of each joint. Therefore, monitoring the robot arm attitude is essential to
determine the rotation angles of the arm joints. One approach to this is constructing an
end-to-end neural network model to directly predict the attitude parameters of the robot
arm through the input image of the robot arm. However, the end-to-end method requires
more computing resources. In addition, it is not easy to make full use of the kinematic
constraints of the robot arm and the imaging constraints from 2D to 3D space. Therefore,
the attitude estimation accuracy of the end-to-end method is difficult to ensure. Another
possible approach to attitude estimation is composed of two stages. First, the feature points
of the robot arm are detected in the image, and then system equations are established to
solve the angle of each joint. This strategy can better leverage the advantages of deep
learning and 3D machine vision theory.
Keypoint detection is a major application of deep learning methods. Toshev et al. [7]
directly located the image coordinates of the keypoints on the human body through convo-
lutional neural networks to determine the human pose. Instead of outputting determined
positions for the detected keypoints, the subsequent keypoint detection networks com-
monly output the positions in the form of heatmaps [8–11]. Newell et al. [8] proposed
SHNet (Stacked Hourglass Network), which stacked several hourglass modules and de-
tected keypoints based on multi-scale image feature information. Chen et al. [9] proposed
CPNet (Cascaded Pyramid Network), which is cascaded by two convolutional neural
network modules. The first module is used to detect all keypoints, and the second module
corrects the poor-quality results detected by the first module to improve the final detec-
tion accuracy. Peng et al. [10] proposed PVNet (Pixel-wise Voting Network), which can
obtain superior keypoint detection results when the target object is partially blocked. Sun
et al. [11] proposed HRNet (High-Resolution Network), which processes feature maps
under multiple-resolution branches in parallel, so that the network can maintain relatively
high-resolution representation in forward propagation. These neural networks have found
successful applications in tasks such as human body posture estimation, where qualitative
understanding instead of quantitative accuracy is the main concern. Due to the huge
demand for computing resources for network training, these neural networks have to resort
to a downsampling process for front-end feature extraction, which leads to the resolution
of the output heatmaps being insufficient for high-accuracy estimation.
Determining how to make the neural network output a higher-resolution heatmap
without significantly increasing the consumption of computing resources is a significant
problem worth investigating. Super-resolution image recovery based on deep learning has
seen great progress in recent years. SRCNNet (Super-Resolution Convolutional Neural
Network) [12], VDSRNet (Very Deep Super-Resolution Network) [13], EDSRNet (Enhanced
Deep Super-Resolution Network) [14], etc., have been proposed. The early super-resolution
reconstruction networks need to upsample a low-resolution input image to the target resolu-
tion for subsequent processing before training and prediction; therefore, the computational
complexity is high. Shi et al. [15] proposed ESPCNNet (Efficient Subpixel Convolutional
Neural Network) from the perspective of reducing computational complexity. This convo-
lutional neural network deals with low-resolution feature maps in the training process and
only adds a subpixel convolution layer to realize upsampling operation in the end, which
effectively increases the speed of super-resolution image reconstruction. It has the potential
Sensors 2024, 24, 305 3 of 21

to improve the resolution of the output keypoint heatmaps and in turn improve the key-
point positioning accuracy by introducing the idea of super-resolution image reconstruction
into the keypoint detection network.
To monitor the attitude of a robot arm, it is essential to solve the rotation angle of
each joint. Based on the depth image of the robot arm, Widmaier et al. [16] used a random
forest regression model to estimate the robot arm attitude. Labbe et al. [17] and Zuo
et al. [18] estimated the robot arm attitude based on one single grayscale image. However,
serious joint occlusion is inevitable in one single-perspective image, which makes it hard
to detect some keypoints and may even lead to wrong estimation results. Moreover, the
depth ambiguity problem in monocular vision may lead to multiple solutions in attitude
estimation, reducing the monitoring reliability of the robot arm.
In this paper, we present a two-stage high-precision attitude estimation method for
base-fixed six-joint robot arms based on multi-view images. The contributions include the
following: (1) A new super-resolution keypoint detection network (SRKDNet for short)
is proposed. The novelty of the SRKDNet lies in that a subpixel convolution module is
incorporated in the backbone neural network HRNet [11] to learn the law of resolution
recovery of the downsampled feature maps. This method can alleviate the disadvantages of
low-resolution heatmaps and improve the keypoint detection accuracy without significantly
increasing the computing resource consumption. (2) A coarse-to-fine detection mechanism
based on dual SRKDNets is put forward. A full-view SRKDNet obtains a relatively rough
keypoint detection result. Then, a close-up SRKDNet is executed to refine the results with
a cropped image of the ROI determined by the results of the full-view SRKDNet. The
dual-SRKDNet detection mechanism performs better than one-time detection, and the
keypoint detection accuracy is drastically improved. (3) Efficient virtual-and-real sampling
and neural network training methods are proposed and verified. The virtual sample data
are first used to train the neural network, and then a small number of real data are applied
to fine-tune the model. This method achieves accurate keypoint detection for real data
without consuming a huge amount of time and manpower. (4) The constraint equations for
solving the rotation angles of each joint are established; they depict the relation among the
detected keypoints in the multi-view images, the camera imaging model and the kinematic
model of the robot arm. A screening strategy based on the keypoint detection confidence
is incorporated in the solving process and is proved to be critical for ensuring attitude
estimation accuracy. Experiments demonstrate that the whole set of methods proposed in
this paper can realize high-accuracy estimation of robotic arm attitude without utilizing
cooperative visual markers.
The remaining contents of this paper are arranged as follows: In Section 2, we in-
troduce the whole set of methods, including the approaches to high-precision keypoint
detection (Section 2.1), automatic virtual sample generation (Section 2.2) and robot arm
attitude estimation (Section 2.3). Experiments on virtual and real robot arms are reported
in Section 3. We conclude the paper in Section 4.

2. Materials and Methods

2.1. High-Precision Detection of Keypoints
The first step of the two-stage attitude estimation method proposed in this paper is
to detect the preset keypoints on/in the robot arm in the images. The detection accuracy
directly affects the accuracy of the joint angle estimation.
The preset keypoints were selected under three basic criteria: (1) The keypoints should
have distinctive features to be identified in the images. (2) The keypoints should be helpful
for determining the attitude of the robot arm. (3) There are at least 2 keypoints on each rod.
Taking the commonly used UR10 robot arm as the research object, we selected 20 keypoints
on/in the robot arm (including the working unit attached at the end). The keypoint set was
composed of the 3D center points (the red points) of each rotating joint, the 3D midpoint
(the blue points) of each rod segment and some salient feature points (the green points) of
the arm, as shown in Figure 1. The first two types of points are inside the structure, and the
on each rod. Taking the commonly used UR10 robot arm as the research object, we se-
lected 20 keypoints on/in the robot arm (including the working unit attached at the end).
The keypoint set was composed of the 3D center points (the red points) of each rotating
Sensors 2024, 24, 305 joint, the 3D midpoint (the blue points) of each rod segment and some salient feature 4 of 21
points (the green points) of the arm, as shown in Figure 1. The first two types of points
are inside the structure, and the third type of points is on the surface of the structure. The
keypoints
third type form a skeleton,
of points is on thewhich
surfacecan eﬀectively
of the characterize
structure. the form
The keypoints attitude of the entire
a skeleton, which
robot arm.
can effectively characterize the attitude of the entire robot arm.

Theselected
Figure1.1.The
Figure selectedkeypoints
keypointson/in
on/inthe
therobot
robotarm.
arm.

Each keypoint was directly selected on the 3D digital model of the robot arm when it
Each keypoint was directly selected on the 3D digital model of the robot arm when it
was in the zero position. To generate the sample data for neural network training, either
was in the zero position. To generate the sample data for neural network training, either
a virtual or real robot arm was controlled to move to the specified positions. The 3D
a virtual or real robot arm was controlled to move to the specified positions. The 3D co-
coordinates of the preset keypoints at each specified position (for instance, the position as
ordinates of the preset keypoints at each specified position (for instance, the position as
in Figure 1) could be obtained according to the kinematics of robotic arms [19], which will
in Figure 1) could be obtained according to the kinematics of robotic arms [19], which
be detailed in Section 2.2.2.
will be detailed in Section 2.2.2.
Given any predefined keypoint position in/on the digital model, its corresponding
Given any predefined keypoint position in/on the digital model, its corresponding
image point can be calculated according to the camera model. In this way, we obtained a
image point can be calculated according to the camera model. In this way, we obtained a
large number of training samples for the keypoint detection network. Experiments show
large number of training samples for the keypoint detection network. Experiments show
that with the predefined keypoints, the keypoint detection network works well and the
that with the predefined keypoints, the keypoint detection network works well and the
arm attitude estimation achieves high accuracy.
arm attitude estimation
This section beginsachieves high
with a brief accuracy. to the backbone network HRNet [11] used
introduction
This
in this section
paper. begins
Then, with a brief
we introduce theintroduction
idea of image to super-resolution
the backbone network HRNet [11]
reconstruction into
used in this paper. Then, we introduce the idea of image super-resolution
keypoint detection and propose a super-resolution keypoint detection network, SRKDNet,reconstruction
into
whichkeypoint detection
can alleviate and proposeofa the
the disadvantages super-resolution keypoint without
low-resolution heatmaps detectionsignificantly
network,
SRKDNet, which can alleviate the disadvantages of the low-resolution
increasing the computing resource consumption. A coarse-to-fine keypoint detectionheatmaps without
significantly increasing the computing resource consumption.
scheme based on dual SRKDNets is also presented in this section. A coarse-to-fine keypoint
detection scheme based on dual SRKDNets is also presented in this section.
2.1.1. Brief Introduction to HRNet
2.1.1. Brief Introduction
To retain to HRNet
high resolution for the feature maps in the forward propagation, the high-
To retain
resolution high resolution
network for thefeature
HRNet processes featuremapsmapsatinvarious
the forward propagation,
resolutions, as shown the
in
high-resolution
Figure 2. First, network HRNet processes
a preprocessing module isfeature
used tomaps at variousthe
downsample resolutions,
input image, as shown
which
inlowers
Figurethe
2. First, a preprocessing
resolution of the outputmodule is used
heatmaps as to downsample
well. The main the input image,
structure of HRNet whichcan
lowers the resolution
be divided of the
into several output
stages, andheatmaps as well.
the branches The mainresolutions
at different structure ofinHRNet can
each stage
beuse the residual
divided blocks
into several to extract
stages, and thefeatures.
branches After each stage,
at diﬀerent a new branch
resolutions is created
in each stage use
without
the abandoning
residual blocks tothe original-resolution
extract features. Afterbranch. The new
each stage, a new branch
branch is obtained
is createdby strided
without
convolutions.
abandoning theThe length and width
original-resolution of theThe
branch. new feature
new branchmapis are reduced
obtained by to 1/2 of
strided the
con-
original, but
volutions. Thethe number
length and of channels
width of thebecomes
new featuretwicemapthatare
ofreduced
the original.
to 1/2Inofthe
thenew stage,
original,
thethe
but feature maps
number ofare createdbecomes
channels by fusing the multi-scale
twice feature maps
that of the original. In theofnew
eachstage,
branchtheinfea-
the
previous stage. The HRNet shown in Figure 2 has four branches with
ture maps are created by fusing the multi-scale feature maps of each branch in the pre- different resolutions.
The final
vious output
stage. feature shown
The HRNet map integrates
in Figurethe information
2 has extracted
four branches withfrom the four
diﬀerent branches
resolutions.
and is used for generating the keypoint heatmap.
resolutions are converted to the same resolution first via upsampling or downsampling
(strided convolutions). Then, they are aggregated to obtain the output maps. HRNet
Sensors 2024, 24, x FOR PEER REVIEW has
5 of 22
powerful multi-scale image feature extraction capability and has been widely used in
classification recognition, semantic segmentation and object detection. We take HRNet as
Sensors 2024, 24, 305
the backbone in our super-resolution keypoint detection network (SRKDNet) which5 will of 21
The final output feature map integrates the information extracted from the four branches
be presented in the next subsection.
and is used for generating the keypoint heatmap.
The multi-scale fusion operation in HRNet is shown in Figure 3. The feature maps
from the branches with the same resolution remain the same while those with different
resolutions are converted to the same resolution first via upsampling or downsampling
(strided convolutions). Then, they are aggregated to obtain the output maps. HRNet has
powerful multi-scale image feature extraction capability and has been widely used in
classification recognition, semantic segmentation and object detection. We take HRNet as
the backbone in our super-resolution keypoint detection network (SRKDNet) which will
be presented in the next subsection.

Preprocessing Feature maps in Heatmap

module different branches

Figure 2.
Figure Network structure
2. Network structure of
of HRNet.
HRNet.

The multi-scale fusion operation in HRNet is shown in Figure 3. The feature maps
from the branches with the same resolution remain the same while those with different
resolutions are converted to the same resolution first via upsampling or downsampling
(strided convolutions). Then, they are aggregated to obtain the output maps. HRNet
has powerful multi-scale image feature extraction capability and has been widely used in
Feature maps
Preprocessing recognition, semantic
classification in
segmentation and object detection. We take HRNet as
Heatmap
module different branches
the backbone in our super-resolution keypoint detection network (SRKDNet) which will be
presented in the next
Figure 2. Network subsection.
structure of HRNet.

feature maps in strided

upsampling
different branches convolution

Figure 3. Multi-scale fusion method of HRNet.

2.1.2. Super-Resolution Keypoint Detection Network (SRKDNet)

If the resolution of the sample images is large, the neural networks will consume
huge amounts of computing resources in the training. Therefore, HRNet and most re-
lated neural networks like SHNet [8] utilize a preprocessing convolution module to
downsample the input images.
feature maps in strided
Taking HRNet for instance, the original images need to be
upsampling
downsampled
different branches to 64 × 64 convolution the preprocessing convolution module. The so-called
through
high-resolution feature map maintained in each subsequent layer of the convolutional
Figure 3.
Figure 3. Multi-scale
Multi-scale fusion fusion method
method of of HRNet.
HRNet.
neural network structure is only 64 × 64. If an ordinary upsampling method, e.g., neigh-
bor interpolation,
2.1.2.
2.1.2. Super-Resolution
Super-Resolution is used to restore
Keypoint
Keypoint the resolution,
Detection
Detection Network
Network the(SRKDNet)
downsampling operation will lead
(SRKDNet)
to precision
IfIf the loss of the keypoint positions in the heatmaps. Suppose the coordinates of a
theresolution
resolutionofofthe sample
the sample images
images is large, the neural
is large, networks
the neural will
networks consume huge
will consume
keypoint
amounts in computing
of the originalresourcesimage are (𝑢, 𝑣),
in the and the
training. downsampling
Therefore, HRNet and is 𝑎; then, the
scalemost co-
huge amounts of computing resources in the training. Therefore, HRNetrelated neural
and most re-
ordinates
networks of the keypoint
like networks
SHNet [8] like on
utilize the corresponding
a preprocessing ground-truth heatmap are (𝑢/𝑎, 𝑣/𝑎),
lated neural SHNet [8] utilize aconvolution
preprocessing module to downsample
convolution modulethe to
which
input have to be
images. rounded
Taking HRNet as ([𝑢/𝑎],
for [𝑣/𝑎]). the
instance, Here, [·] represents
original images the
needround
to beoperator. When
downsampled
downsample the input images. Taking HRNet for instance, the original images need to be
the
to 64keypoint
× 64 through coordinatesthe ([𝑢/𝑎], [𝑣/𝑎]) are re-mapped back
Thetoso-called
the original resolution
downsampled to 64 × preprocessing
64 through theconvolution
preprocessing module.
convolution module. high-resolution
The so-called
denoted
feature as ( u ,
map maintainedv ) , a maximum
in each error is possibly
subsequent generated:
layersubsequent
of the convolutional neural network
high-resolution feature map maintained in each layer of the convolutional
structure is only 64 × 64. If an
neural network structure is only 64 × 64. Ifan ordinary upsampling method,
ordinary e.g.,
upsamplingneighbor interpolation,
method, e.g., neigh- is
used to restore the resolution,  maxdownsampling
the v − v)2 = will
= (u − u )2 + (operation 2
2
a lead to precision loss of the(1)
bor interpolation, is used to restore the resolution, the downsampling operation will lead
keypoint
to precision positions
loss ofin the
the heatmaps.
keypoint Suppose
positions in the
the coordinates
heatmaps. Suppose of a keypoint in the original
the coordinates of a
image
keypoint arein (u,the
v),original
and theimage downsampling
are (𝑢, 𝑣),scaleand theis a;downsampling
then, the coordinatesscale is of𝑎;the keypoint
then, the co-
on the corresponding ground-truth heatmap are (u/a, v/a), which have to be rounded
ordinates of the keypoint on the corresponding ground-truth heatmap are (𝑢/𝑎, 𝑣/𝑎),
as ([u/a], [v/a]). Here, [·] represents the round operator. When the keypoint coordinates
which have to be rounded as ([𝑢/𝑎], [𝑣/𝑎]). Here, [·] represents the round operator. When
([u/a], [v/a]) are re-mapped back to the original resolution denoted as (u′ , v′ ), a maximum
the keypoint coordinates ([𝑢/𝑎], [𝑣/𝑎]) are re-mapped back to the original resolution
error is possibly generated:
denoted as (u , v) , a maximum error is possibly generated:
√
2
q
ε max =max =(u − (uu− )u )+ +(v(v−−vv)) =
2 2 22
′ ′ = 22 a a (1)
2
It can be seen that the larger the downsampling scale is, the more possible accuracy
loss of the keypoint position in the heatmap there will be.
Sensors 2024, 24, x FOR PEER REVIEW 6 of 22

Sensors 2024, 24, 305 6 of 21

It can be seen that the larger the downsampling scale is, the more possible accuracy
loss of the keypoint position in the heatmap there will be.
Inspired by the methods of image super-resolution reconstruction, we propose a
Inspired
super-resolutionby the methods
keypoint of image
detection super-resolution
network (SRKDNet).reconstruction,
As shown in Figure we propose
4, SRK- a
super-resolution keypoint detection network (SRKDNet). As shown in Figure
DNet uses HRNet as the backbone network to extract multi-scale feature information in 4, SRKDNet
uses HRNet The
the images. as the backbone network
preprocessing moduletoextracts
extractshallow
multi-scale feature
features from information in the
the input image
images. The preprocessing module extracts shallow features from the
and downsamples it. The generated low-resolution feature map is then sent to the coreinput image and
downsamples it. The generated low-resolution feature map is then sent to
module of HRNet. Instead of directly sending the feature maps output by the core mod- the core module
of
uleHRNet.
to the 1 Instead of directly
× 1 convolution sending
layer the feature
for heatmap maps output
generation, SRKDNet by the core module
incorporates to
a sub-
the
pixel × 1 convolution
1 convolution layerlayer for heatmap
[15] after the core generation, SRKDNet
module to learn the law incorporates
of resolution a subpixel
recovery
convolution layer [15] after the core module to learn the law of resolution recovery of the
of the downsampled feature maps. In addition, a branch is added to the input side, and
downsampled feature maps. In addition, a branch is added to the input side, and the
the resolution of the output feature map of the branch is the same as that of the input
resolution of the output feature map of the branch is the same as that of the input image
image (also called the original resolution). The branch is combined with the resolu-
(also called the original resolution). The branch is combined with the resolution-recovered
tion-recovered feature map output by the resolution recovery module through channel
feature map output by the resolution recovery module through channel fusion processing.
fusion processing. Finally, a 1 × 1 convolutional layer is used for generating the heatmap
Finally, a 1 × 1 convolutional layer is used for generating the heatmap with the original
with the original resolution.
resolution.

Preprocessing Core module Upsampling Fusion

module of HRNet module operation
High-resolution Convolution
High-resolution heatmap
feature map Feature map

proposed SRKDNet.
Figure 4. Structure of the proposed SRKDNet.

Adding a branch
branchat atthe theinput
inputside sidecan can provide
provide additional
additional shallow
shallow image
image features
features for
for
the the
finalfinal heatmap
heatmap generation.
generation. We believe
We believe that that the combination
the combination of shallow
of shallow and anddeepdeepfea-
feature information
ture information is is conducive
conducive toto keypointdetection.
keypoint detection.InInorder
orderto toavoid
avoid the
the loss
loss of image
information, the branch consists of only one convolution layer
information, the branch consists of only one convolution layer and one batch normaliza- and one batch normalization
layer, with with
tion layer, no activation
no activationlayerlayerinvolved.
involved.
The
The essence
essence ofof subpixel
subpixel convolution
convolution is is to
to rearrange
rearrange thethe pixels
pixels from
from aa low-resolution
low-resolution
image
image withwith more
more channels
channels in in aa specific
specific way way to to form
form aa high-resolution
high-resolution image image with with fewer
fewer
channels.
channels. As shown in Figure 5, pixels of different channels at the same position on the
As shown in Figure 5, pixels of different channels at the same position on the
low-resolution image with size (r2 , H,2W ) are extracted and composed into small squares
low-resolution image with size ( r , H , W ) are extracted and composed into small
with size (r, r ), which together form a high-resolution image with size (1, rH, rW ) Realizing
( r , r )
the subpixel convolution requires convoluting the alow-resolution
squares with size , which together form high-resolution image
feature mapswithfirstsize
to
(1, rH , rW
expand ) Realizing
the number theimage
of subpixel convolution
channels. For requiresthe
instance, convoluting
low-resolution the low-resolution
image in size
Sensors 2024, 24, x FOR PEER REVIEW 7 of 22
of (C, H,maps
feature W ) needs
firsttotobeexpand
expanded size (r2 C,ofH,image
theto number W ) via channels.
convolution Forbefore it can the
instance, be
converted to a high-resolution
low-resolution image in size of image
(C , H in
, W size
) of ( C, rH, rW ) . In our
needs to be expanded to size implementation,
( r 2
C , H , Wr )= 2.
via
convolution
(r , H , W ) before it can be converted
2
(r , r ) to a high-resolution
(1, rH , rW ) image in size of (C , rH , rW ) .
In our implementation, r = 2 .

Figure5.
Figure 5. Subpixel
Subpixel convolution
convolution processing.
processing.

Since the
Since the backbone
backbone network
network HRNet
HRNet has
has aa strong
strong ability
ability to
to detect
detect features
features in
in multiple
multiple
scales, SRKDNet
scales, SRKDNet doesdoes not
not adopt
adopt aa complex
complex super-resolution
super-resolution neural network structure.
Instead, only a subpixel convolution module is applied to enable the convolutional neu-
ral network to learn the knowledge for generating high-quality and high-resolution
heatmaps from low-resolution information. The subsequent experiments can prove its
significant effects in improving the neural network performance.
Sensors 2024, 24, 305 7 of 21

Instead, only a subpixel convolution module is applied to enable the convolutional neural
network to learn the knowledge for generating high-quality and high-resolution heatmaps
from low-resolution information. The subsequent experiments can prove its significant
effects in improving the neural network performance.

2.1.3. Coarse-to-Fine Detection Based on Dual SRKDNets

For collecting the sample data of the robot arm, it is necessary to make the field of view
of the camera completely cover the working space of the robot arm, which is much larger
than the robot arm itself. Therefore, the robot arm occupies only a small region in some
images, while the large background regions have no help in the keypoint detection. To
further improve the detection performance, we propose a coarse-to-fine detection strategy
based on dual SRKDNets. First, an SRKDNet, namely a full-view SRKDNet, is trained by
using the original sample images, as shown in Figure 6a. By using the trained full-view
SRKDNet, the coarse keypoint detection results are obtained (red points in Figure 6b).
Based on the relatively rough detection results, the corresponding region of interest (ROI)
of the robot arm in each image (blue bounding box) is determined, as shown in Figure 6b.
According to the ROI, a new image is cropped from the sample image, as shown in Figure 6c.
Using the cropped new images as the sample data, another SRKDNet, namely close-up
SRKDNet, is trained. The two SRKDNets use the same convolutional neural network
structure, the same training flow and the same setup. For an image to be detected, the
trained full-view SRKDNet is first used for rough detection. The cropped image of the
robot arm ROI is then put into the trained close-up SRKDNet to obtain the final keypoint
detection result, shown as the blue points in Figure 6d. Our experiments have demonstrated
Sensors 2024, 24, x FOR PEER REVIEW
that the detection scheme based on dual SRKDNets can drastically improve the keypoint 8 of 22
detection performance. The details of the experiments can be found in Section 3.

(a) (b)

Figure 6.6. Coarse-to-fine

Figure Coarse-to-fine keypoint
keypoint detection
detection based
based on
on dual
dual SRKDNets.
SRKDNets. (a)
(a) Original
Original sample
sample image.
image.
(b) ROI of the robot arm. (c) Cropped image of the robot arm ROI. (d) Keypoint detection result.
(b) ROI of the robot arm. (c) Cropped image of the robot arm ROI. (d) Keypoint detection result.

2.2.
2.2. Automatic
Automatic Sample
Sample Generation
Generation Based
Based on
on Virtual
Virtual Platform
Platform
2.2.1. Virtual Platform Construction
2.2.1. Virtual Platform Construction
The
Thetraining
trainingofofaaneural
neuralnetwork
networkrequires
requires a large number
a large number of sample data.
of sample TheThe
data. predic-
pre-
tive effect of the neural model is directly related to the quantity and quality of
dictive effect of the neural model is directly related to the quantity and quality of the the sample
data.
sampleTo data.
makeTo themake
neuralthe
network
neural fully “know”
network fullythe robot arm
“know” to be detected
the robot arm to beindetected
the imagein
and make the trained model have a more stable performance, the sample
the image and make the trained model have a more stable performance, the sample images should
im-
be taken
ages frombevarious
should perspectives,
taken from under various
various perspectives, backgrounds
under and lightingand
various backgrounds conditions.
lighting
Obviously,
conditions.collecting
Obviously,a large number
collecting of diverse
a large numbersample data insample
of diverse real industrial scenarios
data in real will
industrial
consume a lot of manpower and time.
scenarios will consume a lot of manpower and time.
For this paper, a virtual platform in UE4 [20] was established to simulate the work-
ing scene of the UR10 robot arm equipped with a working unit. The color, roughness,
high brightness and metallicity of the appearance of the real robot arm are presented in
the platform as far as possible. The base coordinate system of the UR10 robot arm is set as
the world coordinate system. The robot arm in the zero position is shown in Figure 7a. A
dictive effect of the neural model is directly related to the quantity and quality of the
sample data. To make the neural network fully “know” the robot arm to be detected in
the image and make the trained model have a more stable performance, the sample im-
ages should be taken from various perspectives, under various backgrounds and lighting
Sensors 2024, 24, 305 conditions. Obviously, collecting a large number of diverse sample data in real industrial 8 of 21
scenarios will consume a lot of manpower and time.
For this paper, a virtual platform in UE4 [20] was established to simulate the work-
ing scene
For thisof paper,
the UR10 robotplatform
a virtual arm equipped withwas
in UE4 [20] a working
establishedunit. toThe color,the
simulate roughness,
working
scene of the UR10
high brightness androbot arm equipped
metallicity with a working
of the appearance of theunit.
real The
robotcolor,
arm roughness,
are presented highin
brightness
the platform and
as metallicity of the
far as possible. Theappearance of thesystem
base coordinate real robot
of thearmUR10 arerobot
presented
arm isinsettheas
platform
the worldascoordinate
far as possible.
system.The
Thebase
robotcoordinate
arm in the system of the UR10
zero position is shownrobotinarm is set
Figure 7a.asA
the world coordinate system. The robot arm in the zero position is shown
movable skeleton and a parent–child relationship between the adjacent bones are created in Figure 7a. A
movable
accordingskeleton and a parent–child
to the structure and kineticrelationship between
characteristics of the the
robotadjacent
arm, asbones
shown areincreated
Figure
according to thehas
7b. Each bone structure
a headand
andkinetic characteristics
a tail node. The headofnodethe robot arm, as shown
is connected in Figure
to the parent 7b.
bone,
Each bone
and the tailhas
nodea head and a tail
is connected to node. The
the child headEach
bone. node is connected
bone can be rotated to thefreely
parent bone,
with the
and
headthe tailas
node node is connected
a reference point,toand
thethe
child bone.
child Each
bone bone can
connected to be
its rotated
tail nodefreely
moveswith
with theit
head node as a reference point, and the child bone connected to its tail
together. The 3D digital model of each arm segment is bound to the corresponding bone, node moves with it
together. The 3D digital model of each arm segment is bound to the corresponding
so as to drive the articulated arm to move together with the skeleton, as shown in Figure bone, so
as
7c.to drive the articulated arm to move together with the skeleton, as shown in Figure 7c.

(a) (b) (c)

Sensors 2024, 24, x FOR PEER REVIEW 9 of 22

Figure7.
Figure 7. Binding
Binding of
of robot
robot arm
armand
andskeleton.
skeleton. (a)
(a) Robot
Robot arm
arm in
in zero
zero position.
position. (b)
(b) Movable
Movable skeleton
skeleton of
of
the arm. (c) Robot arm moving with the skeleton.
the arm. (c) Robot arm moving with the skeleton.

In UE4,
In UE4, the
the motion
motion posture
posture and and speed
speed of of the
the robot
robot arm
arm can
can bebe easily
easily set;
set; one
one or
or more
more
virtual cameras can be deployed in the scene; the internal and external parametersthe
virtual cameras can be deployed in the scene; the internal and external parameters of of
cameras,
the cameras,as well as the
as well as lighting
the lightingconditions
conditionsand andthe background
the background of theofscene, can be
the scene, canflexi-
be
bly changed.
flexibly changed. In this way,
In this way,wewe virtually
virtually collected
collecteda alarge
largenumber
numberof of sample data under
sample data under
various backgrounds
various backgroundsand and lighting
lighting conditions
conditions for training
for training the SRKDNets.
the SRKDNets. The back-
The background
ground are
settings settings
randomlyare randomly
selected from selected from the
the images in theimages
COCOindataset
the COCO [21]. Adataset
moving [21].
lightA
movingis light
source used source is used in the
in the constructed constructed
virtual platform.virtual platform.
The position andThe positionofand
properties theprop-
light
erties of
source keepthechanging
light source keepthe
during changing
sample dataduring the sample
collection. Richdata collection.settings
background Rich back-
and
ground settings
lighting conditions andinlighting conditions
the sample data canin the
make sample data can
the neural make the
network neural network
insensitive to the
insensitive to the background/lighting
background/lighting changes and more changes
focusedand on more focused
extracting the on extracting
features the
of the fea-
robot
turesitself.
arm of the robot arm itself.
After the
After the virtual
virtual scene
scene with
with thethe robot
robot arm
arm isis established,
established, thethe virtual
virtual cameras
cameras taketake
virtual images
virtual images of of the
the scene
scene toto obtain
obtain thethe synthetic
synthetic imageimage of of the
the robot
robotarm.
arm. These
These virtual
virtual
images
images willwill serve
serve as as the
the training
training samples.
samples. The The attitude
attitude parameters of the the robot
robot arm,
arm, as as
well
well as
as the
the internal
internal andand external
external parameters
parameters of of the
the virtual
virtual cameras
cameras corresponding
corresponding to to each
each
virtual
virtualimage,
image,are arerecorded
recordedfor forimage
image labeling,
labeling, which
which will be be
will detailed in the
detailed nextnext
in the subsection.
subsec-
Figure 8 shows
tion. Figure three three
8 shows typical virtualvirtual
typical sample images
sample of the of
images robot arm. arm.
the robot

Figure 8.
Figure 8. Three
Three synthetic
synthetic sample
sample images
images of
ofthe
therobot
robotarm
armagainst
againstrandom
randombackgrounds.
backgrounds.

2.2.2. Automated Generation and Labeling of Virtual Samples

With the virtual platform constructed in Section 2.2.1, we generated a large number of
virtual sample images using a Python program, utilizing the UnrealCV library [22] and the
blueprint script in UE4. The collection process of a single sample is as follows:
1. The rotation angles of each joint of the robot arm, the pose parameters of the virtual
camera, the light intensity and direction, and the background are randomly generated.
They are used to automatically update the virtual sample collecting scene in UE4.
2. A virtual image of the current virtual scene is taken via the virtual camera.
3. For any keypoint Pj on/in the m-th arm segment, its relative coordinate m P j in the
arm segment coordinate system Cm is converted to 0 P j in the robot base coordinate
system C0 according to the successive parent–child transformation relationship of the
bone segments:
P j = 0m T (θ1 , θ2 , · · · , θm )m P
0e ej (2)

where 0 P
e j is the homogeneous form of 0 P , mP is the homogeneous form of mP
j
e
j j
and 0m T (θ1 , θ2 , θm ) ∈ R4×4 is the transformation matrix from Cm to C0 , which is
determined by the rotation angles θ1 , θ2 , θm of the m joints. The coordinate values of
m P do not change with the movement of the robot arm and can be determined in
j
advance according to the digital model of the robot arm.
4. According to the internal and external parameters of the virtual camera, the pixel
coordinates of each keypoint on the virtual image are calculated by using the camera
imaging model in Formula (3)

p j = K[R t]0 P
se ej (3)

where pe j represents the homogeneous pixel coordinate vector of keypoint Pj in the

virtual image, 0 P
e j is the homogeneous form of the 3D coordinates in the robot base
coordinate system C0 , R and t are the rotation matrix and translation vector from C0
to the virtual camera coordinate system Cc , K is the intrinsic parameter matrix of the
camera and s is a scalar coefficient.
5. According to the pixel coordinates p j of each keypoint Pj , the heatmap label of the
current virtual image is generated. In the generated heatmap of Pj , the weight is set to
the largest value at p j and gradually decreases around p j with Gaussian distribution.
All the heatmaps of Pj ( j = 1, 2, . . . . . . , J ) are concatenated as the training label of the
current virtual sample.
In addition, data enhancements, including random image rotation, translation, scaling
and gray changes, are also carried out on the generated sample images to improve the
robustness and generalization of the model.
Using the above process, we automatically collected a large number of virtual sample
data with training labels.

2.3. Attitude Estimation Based on Multi-View Images

In view of the inevitability of the keypoint occlusions and the lack of depth constraints
in a single-view image, we used multi-view images combined with the proposed keypoint
learning algorithm to estimate the attitude of the robot arm and verify its performance on
accuracy and reliability.

2.3.1. Solving Rotation Angles of Robot Arm Joints

As stated in Section 2.2.2, for the keypoint Pj on the m-th arm segments, its 3D
coordinate 0 P j in the base coordinate system C0 under any robot attitude can be calculated
from the 3D coordinate m P j in Cm according to Equation (2). Suppose that L cameras
Sensors 2024, 24, 305 10 of 21

are arranged to monitor the robot arm from different perspectives, then by combining
Equations (2) and (3), we have the following:
h i
elj = Kl Rl tl 0m T (θ1 , θ2 , · · · , θm )m P
sjp e j , j = 1, 2, · · · , J; l = 1, 2, · · · , L (4)

∼l
where p j denotes the homogeneous pixel coordinates of the keypoint Pj in the l-th camera’s
image plane; Kl , Rl , tl represent the intrinsic and extrinsic parameters of the l-th camera;
and θm is the rotation angle of the m-th joint. For all the keypoints in the multi-view images,
Formula (4) forms an equation system composed of L × J equations.
In the robot arm attitude monitoring process, the image coordinates plj (l = 1, 2, · · · , L)
of the keypoints Pj (j = 1, 2, · · · , J) in the L images are located via the proposed dual SRKD-
Nets; the camera parameters Kl , Rl , tl are known in advance (In the virtual experiments,
the camera parameters can be directly obtained from the settings. In the real experiments,
the intrinsic parameters are determined with the popular calibration method presented
in [23]. The relative geometry relationship between the UR10 robot arm and the cameras
was calibrated in advance with the well-established off-line calibration method presented
Sensors 2024, 24, x FOR PEER REVIEW 11 of 22
in [24,25].); the 3D coordinates of m P j are determined on the 3D digital model of the robot
arm. Therefore, after removing the scale factor s j , the unknowns in the equation system (4)
are only the rotation angles of the joints. The LM (Levenberg–Marquardt) algorithm [26]
obtain
can the joint
be used angles 𝜃the
to optimize 1 , 𝜃equation
2, , 𝜃𝑚 . The initial
system valuesthe
to obtain 𝜃1 , 𝜃2angles
of joint , , 𝜃𝑚θ1are
, θ2 ,randomly as-
· · · , θm . The
signed within their effective ranges.
initial values of θ1 , θ2 , · · · , θm are randomly assigned within their effective ranges.

2.3.2.
2.3.2. Keypoint
Keypoint Screening
Screening Based
Basedon onDetection
DetectionConfidence
Confidence
Some
Some keypoints, especially those on the first
keypoints, especially those on the first segment
segment or or on
on the
the flange
flange ofof the
the robot
robot
arm,
arm, are prone to be occluded by other arm segments in certain perspectives, as shownin
are prone to be occluded by other arm segments in certain perspectives, as shown in
Figure
Figure 9.9. When
When aa keypoint
keypoint is is blocked
blocked inin the
the image,
image, the
the detection
detection reliability
reliability of
of the
the neural
neural
network
networkwill willdecline,
decline,and
andthe error
the between
error between thethe
predicted position
predicted andand
position the real
the position will
real position
be larger
will (see Section
be larger 3 for the
(see Section experimental
3 for results). results).
the experimental The accuracy decline ofdecline
The accuracy the keypoint
of the
detection will inevitably increase the attitude estimation error.
keypoint detection will inevitably increase the attitude estimation error.

Figure9.
Figure 9. Three
Three examples
examplesof
ofrobot
robotarm
armwith
withself-occlusion
self-occlusionin
inreal
realimages.
images.

However, in
However, inthe
thecase
caseofofmonitoring
monitoring with with multi-view
multi-view images,
images, aa keypoint
keypoint is is not
not likely
likely
to be
to be occluded
occluded in in all
all the
the images.
images. Therefore,
Therefore, we we propose
propose aa keypoint
keypoint screening
screening scheme,
scheme,
which
whichisisbased
basedon onthe
thedetection
detectionconfidence
confidence ofof
thethe
keypoint,
keypoint, to to
improve
improve thethe
attitude estima-
attitude esti-
tion accuracy.
mation accuracy.
As
As mentioned
mentioned above,above, thethe value
value of
of each
each pixel
pixel in
in the
the heatmap
heatmap output
output byby the
the SRKDNet
SRKDNet
represents the probability that the image of the keypoint is located
represents the probability that the image of the keypoint is located on that pixel. on that pixel. The pixel
The
with
pixelthe
withlargest probability
the largest valuevalue
probability (i.e., the
(i.e.,detection confidence)
the detection in the
confidence) in heatmap
the heatmap willwill
be
selected as the
be selected asdetection
the detectionresult.result. L images
For theFor from different
the 𝐿 images perspectives,
from different each keypoint
perspectives, each
will have Lwill
keypoint detection
have results,
𝐿 detectionwhose confidence
results, whose values are different.
confidence valuesTheareL detection
different. results
The 𝐿
are sorted from high to low according to their confidence values. Then,
detection results are sorted from high to low according to their confidence values. Then, the results with low
confidence scores are discarded, and at least two results with the highest
the results with low confidence scores are discarded, and at least two results with the scores are kept.
The screened
highest scores results withThe
are kept. highscreened
detectionresults
qualitywith
are substituted
high detectioninto quality
Formulaare (4)substituted
so that the
into Formula (4) so that the attitude of the robot arm can be solved more accurately and
reliably. It should be noted that with this screening scheme, the number of equations in
(4) will be less than 𝐿 × 𝐽, but still far more than the number of unknowns. Therefore, it
can ensure robust solutions.
Sensors 2024, 24, 305 11 of 21

attitude of the robot arm can be solved more accurately and reliably. It should be noted
that with this screening scheme, the number of equations in (4) will be less than L × J, but
still far more than the number of unknowns. Therefore, it can ensure robust solutions.

3. Experiments
3.1. Experiments on Virtual Data
The virtual sample acquisition and labeling methods described in Section 2.2 were
used to generate 11,000 labeled sample images with a resolution of 640 × 640. We randomly
selected 9000 virtual samples as the training set and 1000 virtual samples as the validation
set. The validation set was not included in the training and was only used to verify the
effect of the model after each round of training. The other 1000 virtual samples served
as the test set to demonstrate the final effect of the model after all rounds of training. All
the sample images in this study were monochrome. Before the sample images were put
into the convolutional neural network for training, they were reduced to the resolution of
256 × 256. All the experiments in this study were performed on a Dell workstation with an
RTX2080S graphics card and 8 GB video memory.

3.1.1. Loss Function and Model Training Settings

Denote the labels in a sample batch as {Gi (i = 1, 2, · · · , N )} and the heatmap output
by the convolutional neural network as {Hi (i = 1, 2, · · · , N )}, where N represents the
sample number in a batch during the training process. The number of channels, height and
width of the true heatmap corresponding to a sample image are the same as those of the
predicted heatmap, denoted as C, H and W, respectively. The number of channels C equals
the keypoint number, i.e., 20, in this paper.
The mean square error was used as the loss function:

1 1 1 1 N −1 C −1 H −1 W −1
N C H W i∑ ∑ ∑ ∑ [Gi ( j, h, w) − Hi ( j, h, w)]2
MSE = (5)
=0 j =0 h =0 w =0

SHNet [8], HRNet [11] and the proposed SRKDNet were trained with the generated
virtual data for the comparison of the keypoint detection performance among these models.
The PyTorch library was used to build and train the models. In the training of HRNet
and the proposed SRKDNet, the settings in Ref. [11] were adopted: Adam optimizer was
used; the initial learning rate was set to 0.001; the total training epoch was 45; the data
batch size was 8; the learning rate was reduced once every 15 rounds with a reduction
factor of 0.1. The weights of HRNet were obtained from the pre-trained HRNet on the
ImageNet [27] dataset. For the backbone network of the proposed SRKDNet, the same
initial weights and number of intermediate layers as in Ref. [11] were adopted. For SHNet,
two hourglass modules were stacked, and its training followed the settings in Ref. [8]: the
Rmsprop optimizer was used, the learning rate was initially set to 0.00025 and the neural
network was trained from scratch using Pytorch’s default weight initialization.
The resolution of the heatmaps output by both SHNet and HRNet was 64 × 64. The
standard deviation of the Gaussian distribution of the weights on the corresponding ground-
truth heatmap was set to 1 pixel. The resolution of the heatmaps output by SRKDNet
was 256 × 256, and the standard deviation of the Gaussian distribution of the weights on
the corresponding ground-truth heatmap was set to 3 pixels. The channel number of the
highest-resolution branch of HRNet and SRKDNet was 32. The channel number of the
feature maps in the supplementary branch containing shallow image features in SRKDNet
was set to 8.
Sensors 2024, 24, 305 12 of 21

3.1.2. Experimental Results on Full-View SRKDNet

A commonly used metric, namely the percentage of correct keypoints (PCK) [18], was
adopted to evaluate the keypoint prediction accuracy of each model. It is defined as shown
in Formula (6):
1 A

ei 1, x ≤ τ
PCK = ∑ δ( ), δ ( x ) = (6)
A i=1 enorm 0, x>τ

where A is the total number of predicted results; ei is the pixel distance between the
predicted and the ground-truth positions; enorm is the standard error distance; τ is a specified
threshold to adjust the ratio between the calculated error distance in the experiments and
enorm . If the calculated distance error ei between the predicted and the true positions of
the keypoint is less than enorm × τ , δ equals 1 and the predicted position is considered
correct. The keypoint prediction result of our full-view SRKDNet will be compared with
that of SHNet and HRNet by using PCK as the metric. In our experiment, enorm was
set to 40 pixels and τ was assigned as 0.2, 0.15 or 0.1. Considering that the three13neural
Sensors 2024, 24, x FOR PEER REVIEW of 22
networks output heatmaps with different resolutions, but the detected keypoint positions
need to be mapped back to the original sample images to conduct the subsequent robot arm
attitude estimation, we mapped the predicted coordinates of all keypoints to the original
of the three640
resolution methods, where
× 640 for [email protected],Table
comparison. [email protected]
1 listsand
the [email protected]
PCK valuesrepresent the prediction
of the three methods,
accuracy with 𝜏 [email protected]
where [email protected], 0.2, 𝜏 = 0.15 and 𝜏 = 0.1,
and [email protected] respectively.
represent the prediction accuracy with τ = 0.2,
τ = 0.15 and τ = 0.1, respectively.
Table 1. Experimental results of keypoint detection on virtual samples.
Table 1. Experimental results of keypoint detection on virtual samples.
Methods [email protected] [email protected] [email protected]
SHNet
Methods [email protected] 93.17% 87.27%
[email protected] 66.06%
[email protected]
SHNetHRNet 93.17% 94.91% 88.79%
87.27% 68.14%
66.06%
full-view
HRNet SRKDNet 94.91% 96.23% 94.33%
88.79% 89.07%
68.14%
full-view SRKDNet 96.23% 94.33% 89.07%
The results in Table 1 show that the trained full-view SRKDNet completely outper-
formsThetheresults
two comparison models
in Table 1 show SHNet
that and HRNet
the trained under
full-view all threecompletely
SRKDNet threshold values.
outper-
The
formssmaller thecomparison
the two threshold is,models
the more obvious
SHNet andthe superiority
HRNet under of allthe full-view
three threshold SRKDNet
values.
over the twothe
The smaller comparison
threshold models is. The
is, the more reasons
obvious thefor the superiority
superiority may lie in
of the full-view two as-
SRKDNet
pects: (1)two
over the Using the heatmaps
comparison models with
is. aThe
higher resolution
reasons (256 × 256)may
for the superiority in the
lie training labels
in two aspects:
can reduce
(1) Using thethe negative
heatmaps influence
with a higherof the downsampling
resolution (256 × 256) in operation.
the training(2)labels
The can
predictive
reduce
heatmap withinfluence
the negative the trained super-resolution
of the downsampling layer can express
operation. the detected
(2) The predictive keypoints
heatmapmore with
accurately.
the trained super-resolution layer can express the detected keypoints more accurately.
Figure 10 shows
Figure 10 showsthe thedetection
detection results
results of three
of the the three
keypointkeypoint
detectiondetection
networks networks
SHNet,
HRNet HRNet
SHNet, and ourand full-view SRKDNet
our full-view for thefor
SRKDNet same
the test
sameimage. The The
test image. green dotsdots
green are are
the
realreal
the locations of the
locations of keypoints, and the
the keypoints, andblue
the dots
blue are
dotsthe
arepredicted locations.
the predicted The mean
locations. The
error refers
mean error to the average
refers of the pixel
to the average of the distances betweenbetween
pixel distances the predicted locationslocations
the predicted and the
real locations of all the keypoints. The mean prediction error of full-view
and the real locations of all the keypoints. The mean prediction error of full-view SRK- SRKDNet is
significantly
DNet lower than
is significantly thatthan
lower of SHNet
that ofand SHNetHRNet.
andWe can also
HRNet. Weintuitively see that most
can also intuitively see
keypoint
that most locations
keypointpredicted
locationsby the full-view
predicted by the SRKDNet
full-vieware closer to are
SRKDNet the closer
real location
to the than
real
those predicted
location by the
than those comparison
predicted by themethods.
comparison methods.

Mean error: 3.22 pixel Mean error: 2.92 pixel Mean error: 1.48 pixel

（a） (b) (c)

Figure
Figure 10. Comparison of
10. Comparison ofkeypoint
keypointdetection
detectionresults:
results:
(a)(a) SHNet;
SHNet; (b) (b) HRNet;
HRNet; (c) full-view
(c) full-view SRK-
SRKDNet.
DNet.

3.1.3. Experiment on Occlusion Effect

A large number of experiments have shown that the detection confidence of the oc-
cluded keypoints in the image is low. In the example shown in Figure 11, keypoint No. 1
Sensors 2024, 24, 305 13 of 21

3.1.3. Experiment on Occlusion Effect

A large number of experiments have shown that the detection confidence of the
occluded keypoints in the image is low. In the example shown in Figure 11, keypoint
No. 1 on the end working unit is completely blocked by other arm segments. It turns out
that the maximum confidence of this keypoint in the heatmap is only 0.116. Meanwhile,
keypoint No. 2 on the first arm segment is visible in the image, and its detection confidence
is 0.774, which is much higher than that of keypoint No. 1. It can also be clearly observed
from Figure 11 that the predicted position of keypoint No. 1 deviates a lot from the real
position, while the predicted position of keypoint No. 2 is closely coincident with the real
position. The verified negative influence of self-occlusion on keypoint detection motivated
Sensors 2024, 24, x FOR PEER REVIEW
the screening scheme proposed in Section 2.3.2. Experiments on the screening scheme 14 of 22
will
be reported in Section 3.1.5.

Predicted position of Predicted position of

keypoint No.1 keypoint No.2

Real position of Real position of

keypoint No.1 keypoint No.2

Predicted heatmap of Predicted heatmap of

keypoint No.1 keypoint No.2

Maximum Maximum
confidence: 0.116 confidence: 0.774

Figure 11.
Figure 11. The
The influence
influence of
of self-occlusion
self-occlusion on
on keypoint
keypoint detection.
detection.

The
The GPU
GPU (graphics
(graphics processing
processing unit)
unit) memory
memory occupation
occupation ofof the
the full-view
full-view SRKDNet
SRKDNet
was
was also compared with that of SHNet and HRNet with the batch size set to training.
also compared with that of SHNet and HRNet with the batch size set to 8 in the 8 in the
The resultThe
training. is shown
resultinisTable
shown 2. The output
in Table 2. heatmap
The outputresolution
heatmapof the three convolutional
resolution of the three
neural networks
convolutional is shown
neural in parentheses.
networks is shown in parentheses.
Table 2 shows that HRNet occupies the least GPU memory during the training and
Table 2. Comparison
outputs heatmaps withof GPU memory occupation
a resolution (batch
of only 64 size
× 64. = 8).
The SRKDNet occupies 22.6% more
GPU memory resources than HRNet. This demonstrates that the proposed full-view
Neural Network Model GPU Occupation
SRKDNet can remarkably improve the detection accuracy (see Table 1) at the expense of
a mild increaseSHNet
in GPU × 64)
(64occupation. 3397 MB
HRNet (64 × 64) 3031 MB
full-view SRKDNet (256 × 256) 3717 MB
Table 2. Comparison of GPU memory occupation (batch size = 8).

Table 2 showsNeural
thatNetwork Model the least GPU memory
HRNet occupies GPU Occupation
during the training and
SHNet (64 × 64)
outputs heatmaps with a resolution of only 64 × 64. The SRKDNet occupies 3397 MB 22.6% more
GPU memory resources HRNetthan
(64 ×HRNet.
64) This demonstrates that the3031 MB
proposed full-view
SRKDNet can full-view SRKDNet
remarkably improve(256the
× 256)
detection accuracy (see Table3717
1) atMB
the expense of a
mild increase in GPU occupation.
For
For further
furthercomparison,
comparison,we wecanceled
canceled thethe
downsampling
downsamplingoperations in the
operations preprocess-
in the prepro-
ing stagestage
cessing of HRNet so that
of HRNet HRNet
so that can also
HRNet output
can also heatmaps
output heatmaps a 256a ×
withwith 256256 resolution,
× 256 resolu-
tion, which is the same as that of the full-view SRKDNet. However, the maximum batch
size of HRNet can only be set to 2 in this situation, and the experimental results are
shown in Table 3. These results demonstrate that to enable HRNet to output heatmaps
with the same resolution as that of the full-view SRKDNet, the GPU memory resource
consumption will increase sharply. When batch size = 2, the GPU occupation of HRNet
Sensors 2024, 24, 305 14 of 21

which is the same as that of the full-view SRKDNet. However, the maximum batch size
of HRNet can only be set to 2 in this situation, and the experimental results are shown
in Table 3. These results demonstrate that to enable HRNet to output heatmaps with the
same resolution as that of the full-view SRKDNet, the GPU memory resource consumption
will increase sharply. When batch size = 2, the GPU occupation of HRNet exceeds 277.03%
compared with our full-view SRKDNet.

Table 3. Comparison of GPU memory occupation (batch size = 2).

Neural Network Model GPU Occupation

HRNet (256 × 256) 7371 MB
full-view SRKDNet (256 × 256) 1955 MB

3.1.4. Experimental Results on Dual SRDKNets

In this section, we present the results of keypoint detection conducted with dual SRD-
KNets to verify the effect of the coarse-to-fine detection strategy. Based on the detection
results of the full-view SRKDNet in Section 3.1.2 for each virtual sample image, the corre-
sponding region of interest (ROI, i.e., the region within the bounding box of the detected
keypoints) of the robot arm in each image was determined. The close-up SRKDNet was
then trained using the clipped local ROI images.
In the experiments on dual SRDKNets, for any virtual test samples, the trained full-
view SRKDNet was first used to conduct initial keypoint detection. Then, the ROI image
was input to the trained close-up SRKDNet to achieve the final detection results. The
threshold τ was set to 0.15, 0.1 and 0.05, which were more stringent in order to adapt to
the detection accuracy increase. The other settings were consistent with the experiments in
Section 3.1.2.
The keypoint detection results of the full-view SRKDNet and dual SRDKNets are
shown in Table 4. In the “full-view SRKDNet” method, only the trained full-view SRKDNet
was utilized for the keypoint detection. In the “dual SRKDNet” method, the trained
close-up SRKDNet was used following the full-view SRKDNet.

Table 4. Comparison of full-view SRKDNet only and dual SRKDNets.

Method [email protected] [email protected] [email protected]

full-view SRKDNet 94.33% 89.07% 62.14%
dual SRKDNet 98.69% 97.66% 93.92%

Table 4 shows that the PCK score of dual SRKDNets is higher than that of the full-view
SRKDNet, which means that the use of the close-up SRKDNet can effectively improve
the keypoint detection accuracy. When a more stringent threshold is set, a more obvious
improvement can be achieved. When τ = 0.05, in other words, when the distance threshold
between the detected and the real keypoint positions was set to 2 pixels, the keypoint
detection accuracy increased from 62.14% to 93.92%. When the threshold was assigned as
0.1, the PCK score of the close-up SRKDNet increased to 97.66%, compared to 89.07% of
the full-view SRKDNet.
The above experimental results demonstrate that the proposed successive working
mechanism of the dual SRKDNets is quite effective. The close-up SRKDNet can further
improve the keypoint detection accuracy by a large margin.

3.1.5. Robot Arm Attitude Estimation Experiments

We used the dual SRKDNets and virtual multi-view images to verify the effect of
robot arm attitude estimation. Specifically, four cameras were arranged in the UE4 virtual
environment, and 1000 sets of four-perspective sample images were collected using the
method in Section 2.2. The setting values of the six joint rotation angles corresponding to
Sensors 2024, 24, 305 15 of 21

each of the 1000 sets of images were recorded as the ground-truth values of the 1000 attitude
estimation experiments. The full-view SRKDNet trained as described in Section 3.1.2 and
the close-up SRKDNet trained as described in Section 3.1.4 were used for the coarse-to-fine
keypoint detection.
Sensors 2024, 24, x FOR PEER REVIEW 16 of 22
The comparison experiments of single-view and multi-view attitude estimation, as
well as the comparison of using and not using the confidence-based keypoint screening
scheme, were conducted. The specific keypoint screening method for the four-perspective
four-perspective
sample sample
images adopted images
in the adopted
attitude in theexperiments
estimation attitude estimation experiments
was as follows: was as
The detection
follows: The
keypoints detection
with the top keypoints
three highest with the top three
confidence highest
values wereconfidence
kept. If thevalues
fourthwere kept.
detection
If the fourth detection result had a confidence score greater than 0.9, it would
result had a confidence score greater than 0.9, it would also be retained; otherwise, it would also be re-
tained; otherwise,
be discarded. it would be discarded.
The average
The average error
error of
of the
the estimated
estimated rotation
rotation angles
angles ofof the
the 1000
1000 experiments
experiments of of each
each
joint is
joint is shown
shown in in Table
Table 5.5. “Single
“Single view”
view” means
means the
the attitude
attitude estimation
estimation waswas performed
performed
based on
based on the
the information
information fromfrom oneone single-perspective
single-perspective image
image (we(we randomly
randomly selected
selected thethe
1000 sample images collected by the second camera); “four
1000 sample images collected by the second camera); “four views” means that imageviews” means that image
from all
from all four
four perspectives
perspectives was was used
used in inthe
theattitude
attitudeestimation;
estimation; “four
“fourviews
views++ confidence
confidence
screening” means
screening” means thethe multi-view
multi-view keypoint
keypoint screening
screening scheme
scheme was was utilized
utilized on
on the
the basis
basis ofof
“four-perspective”.
“four-perspective”.

Table5.5. Robot
Table Robot arm
arm attitude
attitude estimation
estimationerrors
errorsbased
basedon
onvirtual
virtualimages
images(unit:
(unit:degree).
degree).

Method Joint-1 Joint-2 Joint-3 Joint-4 Joint-5 Joint-6 Average

Method Joint-1 Joint-2 Joint-3 Joint-4 Joint-5 Joint-6 Average
Single view 0.24 0.42 1.33 5.55 6.72 12.91 4.53
Single view 0.24 0.42 1.33 5.55 6.72 12.91 4.53
Four views 0.12 Four views
0.17 0.12
0.30 0.17 1.41 0.30 1.41
1.42 1.422.95 2.95 1.06 1.06
Four views + confidence Four views + confi-
0.05 0.06 0.05
0.13 0.06 0.70 0.13 0.70
0.73 0.731.72 1.72 0.57
0.57
screening dence screening

The above experimental results demonstrate that the average estimation error of the
The above experimental results demonstrate that the average estimation error of the
joint angles using images from four perspectives was reduced by 76.60% compared with
joint angles using images from four perspectives was reduced by 76.60% compared with
that using the information from one perspective only. The confidence-based keypoint
that using the information from one perspective only. The confidence-based keypoint
screening scheme
screening scheme further
further reduced
reduced the
the average
average error
error of
of the
the four-view
four-viewattitude
attitude estimation
estimation
by 46.23%.
by 46.23%.The
Thecompound
compound accuracy
accuracy increase
increase reaches
reaches nearlynearly an order
an order of magnitude,
of magnitude, which
which proves that the whole set of methods proposed in this paper is
proves that the whole set of methods proposed in this paper is very effective. very effective.

3.2. Experiments
3.2. Experiments on
on Real
Real Robot
Robot Arm
Arm
3.2.1.
3.2.1. Real
Real Data
Data Acquisition
Acquisition
The
The scene
sceneof
ofaareal
realrobot
robotarm
armattitude
attitudeestimation
estimationexperiment
experimentis shown
is shownin in
Figure 12,12,
Figure in
which three
in which cameras
three are distributed
cameras around
are distributed a UR10
around robotrobot
a UR10 arm. arm.
The intrinsic parameters
The intrinsic of
parame-
the
terscameras and the and
of the cameras transformation matrix ofmatrix
the transformation each camera
of eachcoordinate system relative
camera coordinate to the
system rel-
base coordinate system of the robot arm were calibrated in advance by using well-studied
ative to the base coordinate system of the robot arm were calibrated in advance by using
methods [23–25].
well-studied methods [23–25].

Camera 1 Camera 2

Camera 3

UR10 robot arm

Figure 12.
Figure 12. Experiment
Experiment scene
scene of
of real
real robot
robotarm
armattitude
attitude estimation.
estimation.

We planned 648 positions for the flange endpoint of the robot arm in its working
space as the sample positions, as shown in Figure 13. Each sample position corresponded
to a set of six joint angles. After the robot arm reached each sample position, the three
cameras collected images synchronously and automatically recorded the current joint
angles and the 3D coordinates of the center endpoint of the flange in the base coordinate
Sensors 2024, 24, 305 16 of 21

We planned 648 positions for the flange endpoint of the robot arm in its working space
as the sample positions, as shown in Figure 13. Each sample position corresponded to a
set of six joint angles. After the robot arm reached each sample position, the three cameras
collected images synchronously and automatically recorded the current joint angles and
Sensors 2024, 24, x FOR PEER REVIEW the 3D coordinates of the center endpoint of the flange in the base coordinate17system.
of 22 This
process was repeated until the real sample data collection was completed. A total of 1944
images were captured by the three real cameras.

Figure Figure 13. Sample

13. Sample planning
planning of realarm
of real robot robot
in arm in working
working space. space.

The resolution of the industrial camera used in the experiment was 5120 × 5120. To
The resolution of the industrial camera used in the experiment was 5120 × 5120. To
facilitate the training and prediction of the keypoint detection networks, and to unify
facilitate the training and prediction of the keypoint detection networks, and to unify the
the experimental standards, the resolution of the collected real images was reduced to
experimental standards, the resolution of the collected real images was reduced to 640 ×
640 × 640, which was the same as the resolution of the virtually synthesized images. The
640, which was the same as the resolution of the virtually synthesized images. The de-
detected keypoint positions were mapped back to the initial images for the robot arm
tected keypoint positions were mapped back to the initial images for the robot arm atti-
attitude estimation.
tude estimation.
3.2.2. Keypoint Detection Experiment on Real Robot Arm
3.2.2. Keypoint Detection Experiment on Real Robot Arm
The real UR10 robot arm is consistent with the digital model in the virtual sampling
The real UR10
platform. robot armallis the
Therefore, consistent
settingswith the digital
in the model in
experiments in Section
the virtual3.1,sampling
the geometric
platform.
parameters of the robot arm, the 3D coordinates of the keypoints in the armpa-
Therefore, all the settings in the experiments in Section 3.1, the geometric segment
rameters of the robot
coordinate systemarm, thethe
and 3Dkinematic
coordinates of the
model keypoints
of the robot armin the
werearm segment
also appliedcoor-to the real
dinaterobot
system armand the kinematic
attitude estimation model of the robot arm were also applied to the real
experiments.
robot arm attitude
When the estimation
full-viewexperiments.
SRKDNet trained using the virtual samples as described in Sec-
When the full-view
tion 3.1.2 was usedSRKDNet
to detect trained using the
the keypoints virtual
in the realsamples
images, as itsdescribed
detectionin Sec-
accuracy on
tion 3.1.2
realwas
imagesused wasto only
detect34.11%,
the keypoints
20.86% and in the real when
7.06% images, theits detectionτ was
threshold accuracy
set toon0.15, 0.1
real images wasrespectively.
and 0.05, only 34.11%,Therefore,
20.86% and we7.06% whenusing
considered the threshold τ was set
the real sample datatoto0.15, 0.1 the
fine-tune
and 0.05, respectively.
trained model. Therefore, we considered using the real sample data to fine-tune
the trained From
model.the 1944 images (648 sets of triple-view real data) obtained in Section 3.2.1, 99 real
From
samplethedata
1944(33images
sets) (648
weresets of triple-view
randomly selectedreal data) obtained
as training in Section
sets. Another 3.2.1, 99
randomly selected
99 realdata
real sample sample
(33 data
sets) (33
were sets) served as
randomly the validation
selected as trainingsets.sets.
TheAnother
remaining 1746 samples
randomly
(582
selected 99sets)
real were
sample used as the
data (33 test
sets)sets to evaluate
served as the the performance
validation of keypoint
sets. The remaining detection
1746 and
samplesattitude estimation.
(582 sets) were used as the test sets to evaluate the performance of keypoint de-
tection andThe full-view
attitude SRKDNet and the close-up SRKDNet pre-trained with virtual sample
estimation.
data
The were both
full-view fine-tuned
SRKDNet andwiththe the training
close-up sets. Forpre-trained
SRKDNet comparison,with we also tried
virtual the method
sample
in which
data were boththe full-view with
fine-tuned SRKDNet and the close-up
the training sets. ForSRKDNet
comparison, werewe trained
also not with
tried thevirtual
method in which the full-view SRKDNet and the close-up SRKDNet were trained not
with virtual sample data but directly with the real training sets. Since the number of real
samples used for training was very small, the training epoch was set to 300. The learning
rate was reduced once for every 100 epochs with a reduction coefficient of 0.1. The other
settings were consistent with those in Section 3.1.1. The keypoint detection results of the
Sensors 2024, 24, 305 17 of 21

sample data but directly with the real training sets. Since the number of real samples
used for training was very small, the training epoch was set to 300. The learning rate was
reduced once for every 100 epochs with a reduction coefficient of 0.1. The other settings
were consistent with those in Section 3.1.1. The keypoint detection results of the 1746
real test samples with the model trained with these methods are shown in Table 6. The
comparison of all the experimental results was still evaluated at a resolution of 640 × 640.
The first row displays the result of applying only the full-view SRKDNet trained with
the 99 real sample data (training sets) to detect the keypoints. The second row displays
the result of applying only the full-view SRKDNet trained with our virtual datasets and
fine-tuned with the 99 real sample data (training sets) to detect the keypoints. The keypoint
detection results of dual SRKDNets trained with the 99 real sample data (training sets) are
shown in the third row. The last row displays the result of dual SRKDNets trained with
our virtual datasets and fine-tuned with the 99 real sample data (training sets). The initial
weights of the backbone network of these models were obtained from HRNet pre-trained
with the ImageNet dataset.

Table 6. Experimental results of keypoint detection for real robot arm.

Virtual Sample Data

Method [email protected] [email protected] [email protected]
Used in Training
Full-view SRKDNet No 92.01% 83.38% 47.27%
Full-view SRKDNet Yes 95.35% 88.48% 56.26%
Dual SRKDNets No 91.68% 85.72% 66.50%
Dual SRKDNets Yes 96.07% 92.51% 78.07%

Table 6 shows that no matter whether full-view SRKDNet only or dual SRKDNets
were used for the keypoint detection, the model trained with the virtual samples generated
in Section 3.1.2 and fine-tuned with 99 real sample data (training sets) demonstrates much
better detection accuracy on the real test. When the full-view SRKDNet and close-up
SRKDNet were trained with no virtual data but only the 99 real sample data (training sets),
the keypoint detection accuracy values on the real test images were obviously lower. The
results of this experiment verify the following: (1) the virtual samples generated with the
proposed method in Section 2.2 have a significant positive effect on the keypoint detection
of the robot arm in real scenes; (2) small amounts of real samples can efficiently re-train the
model having been trained with virtual samples and achieve high generalization on real
robot arms.
An example of keypoint detection in a realistic scenario is shown in Figure 14, where
the first row shows the situation of using the full-view SRKDNet only, and the second row
shows the situation of using the full-view SRKDNet and the close-up SRKDNet. The first
column in Figure 14 shows the input real images, the second column shows the achieved
heatmaps and the third column illustrates the keypoint detection results.
high generalization on real robot arms.
An example of keypoint detection in a realistic scenario is shown in Figure 14, where
the first row shows the situation of using the full-view SRKDNet only, and the second
row shows the situation of using the full-view SRKDNet and the close-up SRKDNet. The
Sensors 2024, 24, 305 18 ofthe
21
first column in Figure 14 shows the input real images, the second column shows
achieved heatmaps and the third column illustrates the keypoint detection results.

Figure 14.
Figure 14. Keypoint
Keypoint detection
detection in
in realistic
realistic scenario.
scenario.

3.2.3. Attitude Estimation Experiment on Real Robot Arm

In this section, we report the real robot arm attitude estimation experiment, which was
carried out based on the keypoint detection method with the highest detection accuracy
in Section 3.2.2, i.e., the last method in Table 6. The confidence-based keypoint screening
scheme was employed in the triple-view attitude estimation process. For each keypoint,
the results with the first and the second highest confidence scores were retained. If the
confidence score of the lowest results was larger than 0.9, it was also retained; otherwise, it
was discarded. The distortion compensation was conducted according to the calibrated
camera parameters to rectify the detected keypoint position.
The average estimation errors of each joint angle on the 582 real test sets collected from
three perspectives are shown in Table 7. The ground-truth values of each joint angle were
obtained from the control center of the robot arm when the sample data were collected.
The error of each joint in the table is the average of 582 estimated results.

Table 7. Experimental results of real robot arm attitude estimation (unit: degree).

Joint No. Joint-1 Joint-2 Joint-3 Joint-4 Joint-5 Joint-6 Average

Average
0.15 0.10 0.15 0.55 0.78 1.47 0.53
error

The experimental results in Table 7 show that the whole set of methods proposed in
this paper can achieve high-precision attitude estimation for the robot arm under realistic
scenarios. The total average error of all of the six joints is only 0.53◦ , which is even slightly
better than the average error of 0.57◦ in the attitude estimation for the virtual test samples
in Table 5. Here, we briefly analyze the reasons. When sampling in the virtual environment,
each joint angle and the camera poses were set at random. Therefore, a certain number of
virtual samples are unrealistic, such as interference existing between the arm segments,
and positions of the arm segments being too tight, which results in self-occlusion of some
keypoints in all perspectives. It is hard to estimate the attitude of the robot arm with these
virtual samples. However, in the real scenario, only the normal working attitudes appear
in the sample data, with no extremely strange attitudes for the sake of safety. This may
explain the reason why the attitude estimation accuracy using three-view information in
real scenes is slightly better than that using four-view information in a virtual dataset.
The average joint angle estimation errors of a four-joint robot arm reported in Ref. [17]
and Ref. [18] are 4.81 degrees and 5.49 degrees, respectively, while the average error of the
estimated joint angle of the six-joint UR10 robot arm with our method is only 0.53 degrees,
which is significantly lower. The reasons may lie in three aspects: (1) The SRKDNet
proposed in this paper learns the heatmaps with higher resolution by adding a subpixel
Sensors 2024, 24, 305 19 of 21

convolutional layer. In addition, the combination detection scheme based on the full-view
and close-up dual SRKDNets significantly improves the detection accuracy of the keypoints.
(2) The existing methods only use images from one single view, while our method uses
multi-view images, which can effectively alleviate the problems of self-occlusion and depth
ambiguity exhibited in single-view images. Moreover, the negative influence of improper
keypoint detection results can be greatly reduced by using the information redundancy
of the multi-view images and the confidence-based keypoint screening scheme. (3) The
existing methods not only estimate the joint angles of the robot arm but also estimate the
position and attitude of the robot arm relative to the camera, so there are 10 unknowns to
be estimated. Considering that the base of the robot arm is usually fixed in the industrial
scenes, we determine the relative geometry relationship between the robot arm and the
cameras through a well-established off-line calibration process to simplify the problem
to six unknowns. The above aspects of our method together contribute to the significant
accuracy improvements.
We also counted the time consumed in the two stages of keypoint detection and at-
titude estimation in the real scene, as shown in Table 8. The total time used for keypoint
detection in the three-perspective images using the dual SRKDNets was 0.28 s. The reso-
lution of the sample images used for the keypoint detection here was still 640 × 640. The
time required for solving the joint angles of the real robot arm was 0.09 s.

Table 8. Time consumption of each stage (unit: second).

Triple-View Keypoint Robot Arm Attitude

Stage
Detection Estimation
Time consumption 0.28 0.09

4. Conclusions
We have proposed a set of methods for accurately estimating the robot arm attitude
based on multi-view images. By incorporating a subpixel convolution layer into the back-
bone neural network, we put forward the SRKDNet to output high-resolution heatmaps
without significantly increasing the computational resource consumption. A virtual sample
generation platform and a keypoint detection mechanism based on dual SRKDNets were
proposed to improve the keypoint detection accuracy. The keypoint prediction accuracy
for the real robot arm is up to 96.07% for [email protected] (i.e., the position deviation between
the predicted and the real keypoints is within 6 pixels). An equation system, involving the
camera imaging model, the robot arm kinematic model, and the keypoints detected with
confidence values, was established and solved to finally obtain the rotation angles of the
joints. The confidence-based keypoint screening scheme makes full use of the information
redundancy of the multi-view images and is proven to be effective in ensuring attitude
estimation. Plenty of experiments on virtual and real robot arm samples were conducted,
and the results show that the proposed method can significantly improve the robot arm
attitude estimation accuracy. The average estimation error of the joint angles of the real
six-joint UR10 robot arm under three views is as low as 0.53 degrees, which is much higher
than that of the comparison methods. The entire proposed method is more suitable for
industrial applications with high precision requirements for robot arm attitude estimation.
In the real triple-view monitoring scenario, a total of 0.37 s was required for the
keypoint detection stage and the attitude-solving stage. The keypoint detection accounted
for the most time. The reason lies in that our method needs to detect keypoints in multi-
view images with dual SRKDNets. Therefore, the efficiency of the proposed method is
lower than that of the single-view-based method.
In this study, we only conducted experiments on one U10 robot arm. In the future, we
will try to extend our method to real industrial scenes with more types of robot arms.

Author Contributions: The work described in this article is the collaborative development of all
authors. Conceptualization, L.Z. (Liyan Zhang); methodology; L.Z. (Ling Zhou) and R.W.; software,
Sensors 2024, 24, 305 20 of 21

R.W.; validation, L.Z. (Ling Zhou) and R.W.; formal analysis, L.Z. (Ling Zhou); investigation, L.Z.
(Ling Zhou) and R.W.; resources, L.Z. (Liyan Zhang); data curation, R.W.; writing—original draft
preparation, L.Z. (Ling Zhou); writing—review and editing, R.W. and L.Z. (Liyan Zhang); visual-
ization, R.W. and L.Z. (Ling Zhou); supervision, (Liyan Zhang); project administration, L.Z. (Liyan
Zhang); funding acquisition, L.Z. (Liyan Zhang). All authors have read and agreed to the published
version of the manuscript.
Funding: This research was funded by the National Science Foundation of China (Grant number
52075260) and the Key Research and Development Program of Jiangsu Province, China (Grant number
BE2023086).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data are available from the corresponding author on reasonable
request.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Lin, L.; Yang, Y.; Song, Y.; Nemec, B.; Ude, A.; Rytz, J.A.; Buch, A.G.; Krüger, N.; Savarimuthu, T.R. Peg-in-Hole assembly under
uncertain pose estimation. In Proceedings of the 11th World Congress on Intelligent Control and Automation, Shenyang, China,
29 June–4 July 2014; pp. 2842–2847.
2. Smys, S.; Ranganathan, G. Robot assisted sensing, control and manufacture in automobile industry. J. ISMAC 2019, 1, 180–187.
3. Bu, L.; Chen, C.; Hu, G.; Sugirbay, A.; Sun, H.; Chen, J. Design and evaluation of a robotic apple harvester using optimized
picking patterns. Comput. Electron. Agric. 2022, 198, 107092. [CrossRef]
4. Lu, G.; Li, Y.; Jin, S.; Zheng, Y.; Chen, W.; Zheng, X. A realtime motion capture framework for synchronized neural decoding. In
Proceedings of the 2011 IEEE International Symposium on VR Innovation, Singapore, 19–20 March 2011. [CrossRef]
5. Verma, A.; Kofman, J.; Wu, X. Application of Markerless Image-Based Arm Tracking to Robot-Manipulator Teleoperation. In
Proceedings of the 2004 First Canadian Conference on Computer and Robot Vision, 2004, Proceedings, London, ON, Canada,
17–19 May 2004; pp. 201–208.
6. Liang, C.J.; Lundeen, K.M.; McGee, W.; Menassa, C.C.; Lee, S.; Kamat, V.R. A vision-based marker-less pose estimation system for
articulated construction robots. Autom. Constr. 2019, 104, 80–94. [CrossRef]
7. Toshev, A.; Szegedy, C. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the 2014 IEEE Conference
on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660.
8. Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference
on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499.
9. Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In
Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June
2018; pp. 7103–7112.
10. Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNet: Pixel-wise voting network for 6DoF pose estimation. In Proceedings of the
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4556–4565.
11. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696.
12. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach.
Intell. 2016, 38, 295–307. [CrossRef] [PubMed]
13. Kim, J.; Lee, J.; Lee, K. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654.
14. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017;
pp. 1132–1140.
15. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video
super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883.
16. Widmaier, F.; Kappler, D.; Schaal, S.; Bohg, J. Robot arm pose estimation by pixel-wise regression of joint angles. In Proceedings
of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; pp. 616–623.
17. Labbé, Y.; Carpentier, J.; Aubry, M.; Sivic, J. Single-view robot pose and joint angle estimation via render & compare. In
Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June
2021; pp. 1654–1663.
Sensors 2024, 24, 305 21 of 21

18. Zuo, Y.; Qiu, W.; Xie, L.; Zhong, F.; Wang, Y.; Yuille, A.L. CRAVES: Controlling robotic arm with a vision-based economic system.
In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20
June 2019; pp. 4209–4218.
19. Liu, Q.; Yang, D.; Hao, W.; Wei, Y. Research on Kinematic Modeling and Analysis Methods of UR Robot. In Proceedings of the
2018 IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 14–16 December
2018; pp. 159–164.
20. Sanders, A. An Introduction to Unreal Engine 4; CRC Press: Boca Raton, FL, USA; Taylor & Francis Group: Abingdon, UK, 2017.
21. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in
context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755.
22. Qiu, W.; Zhong, F.; Zhang, Y.; Qiao, S.; Xiao, Z.; Kim, T.S.; Wang, Y. UnrealCV: Virtual worlds for computer vision. In Proceedings
of the 2017 ACM, Tacoma, WA, USA, 18–20 August 2017; pp. 1221–1224.
23. Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [CrossRef]
24. Strobl, K.; Hirzinger, G. Optimal hand-eye calibration. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent
Robots and Systems, Beijing, China, 9–13 October 2006; pp. 4647–4653.
25. Park, F.; Martin, B. Robot sensor calibration: Solving AX=XB on the euclidean group. IEEE Trans. Robot. Autom. 1994, 10, 717–721.
[CrossRef]
26. Levenberg, K. A method for the solution of certain problems in least squares. Quart. Appl. Mach. 1944, 2, 164–168. [CrossRef]
27. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.