0% found this document useful (0 votes)
3 views

robotics-12-00012-v2

Uploaded by

Chay Tick Fei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

robotics-12-00012-v2

Uploaded by

Chay Tick Fei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

robotics

Article
Simulated and Real Robotic Reach, Grasp, and Pick-and-Place
Using Combined Reinforcement Learning and
Traditional Controls
Andrew Lobbezoo * and Hyock-Ju Kwon

AI for Manufacturing Laboratory, Department of Mechanical and Mechatronics Engineering, University of


Waterloo, Waterloo, ON N2L 3G1, Canada
* Correspondence: [email protected]

Abstract: The majority of robots in factories today are operated with conventional control strategies
that require individual programming on a task-by-task basis, with no margin for error. As an
alternative to the rudimentary operation planning and task-programming techniques, machine
learning has shown significant promise for higher-level task planning, with the development of
reinforcement learning (RL)-based control strategies. This paper reviews the implementation of
combined traditional and RL control for simulated and real environments to validate the RL approach
for standard industrial tasks such as reach, grasp, and pick-and-place. The goal of this research is to
bring intelligence to robotic control so that robotic operations can be completed without precisely
defining the environment, constraints, and the action plan. The results from this approach provide
optimistic preliminary data on the application of RL to real-world robotics.

Keywords: reinforcement learning; proximal policy optimization; soft actor-critic; simulation


environment; robot operating system; robotic control; Franka Panda robot; pick-and-place;
real-world robotics

Citation: Lobbezoo, A.; Kwon, H.-J.


Simulated and Real Robotic Reach,
1. Introduction
Grasp, and Pick-and-Place Using
Combined Reinforcement Learning
Over 50 years ago, the first electrically powered robot, the Stanford Arm, was built [1,2].
and Traditional Controls. Robotics
Surprisingly, the mechanics, control strategies, and applications of the Stanford Arm are
2023, 12, 12. https://doi.org/ similar to modern robots, such as the Panda research robot (Franka Emika, 2018). Both
10.3390/robotics12010012 robots are electronically actuated and are implemented with conventional control requiring
precise instructions.
Academic Editors: Roman
Mykhailyshyn and Ann 1.1. Project Motivation
Majewicz Fey
The difficulty with conventional control is that it requires individual programming on
Received: 4 December 2022 a task-by-task basis with no margin for error. These strategies rely on experienced techni-
Revised: 8 January 2023 cians or robotics engineers sending commands on graphical or text-based programming
Accepted: 9 January 2023 interfaces to perform sequences of simple tasks [1–4]. Reinforcement learning (RL)-based
Published: 16 January 2023 control strategies have shown potential for replacing this manual approach [5–13]. In RL,
agents are presented with a task, which they learn to solve by exploring various action
sequences on internal simulated models of the environment, or in the real world [14].
Compared to other RL applications such as self-driving cars and video games [14–16],
Copyright: © 2023 by the authors.
robotic control is difficult due to the high dimensionality of the problem and the continuous
Licensee MDPI, Basel, Switzerland.
space of actions [17–19].
This article is an open access article
distributed under the terms and
1.2. State of Research
conditions of the Creative Commons
Attribution (CC BY) license (https://
To date, RL has been successfully applied to robotics for basic tasks such as target
creativecommons.org/licenses/by/ object reaching, grasping, placement, and basic manipulation [8,20–22] which gives some
4.0/). indication of its potential as a method for controlling robotic agents [23]. However, there

Robotics 2023, 12, 12. https://doi.org/10.3390/robotics12010012 https://www.mdpi.com/journal/robotics


Robotics 2023, 12, 12 2 of 19

is room for further exploration and research for tasks with long action sequences (pick-
and-place), there is a need for one-to-one comparisons between RL methods, and there
is an absence of real-world testing to validate the applicability of RL outside of simula-
tion. A detailed review on the current state of RL for robotic research can be found in
Lobbezoo et al. [24], Mohammed and Chua [25], Liu et al. [11], and Tai et al. [6].

1.3. Objective
The principal objective of this research is to explore the application of RL to sim-
ulated and real-world robotic agents to develop a method for replacing high-level task
programming. The objective has been broken down into the following subobjectives: (1) the
development of a pipeline for training robotic agents in simulation, (2) the training of vari-
ous models and the comparison of performance between each, and (3) testing of the RL
control system in the real world.

1.4. Contribution
The novel contributions of this research to the field can be summarized as follows.
1. We developed a novel pipeline for combining traditional control with RL to vali-
date the applicability of RL for end-to-end high-level robotic arm task planning and
trajectory generation.
2. We modified and tuned the hyperparameters and networks of two existing RL frame-
works to enable the completion of several standard robotics tasks without the use of
manual control.
3. We completed validation testing in the real world to confirm the feasibility and
potential of this approach for replacing manual task programming.
Other minor contributions include the following.
1. We created realistic simulation tasks for training and testing the application of RL for
robotic control.
2. We completed direct comparisons between PPO and SAC to review the potential of
each for task learning.

2. Materials and Methods


2.1. Simulation Methodology
To complete the simulation objectives, a physics engine was selected, custom tasks
were designed, a codebase was implemented, and rewards were shaped according to
the tasks.

2.1.1. Physics Engine


Three common environments for robotic representations are Gazebo, MuJoCo, and
PyBullet as shown in Figure 1. Each package has strengths and weaknesses as evaluated
and compared below.
Due to Gazebo’s [26] functionality over a robot operating system (ROS), the com-
munication between the control system and the simulated robot perfectly replicates the
real-world communication. However, compared to MuJoCo and Pybullet, Gazebo has a
higher computational cost on the GPU. Due to the requirement of parallelization of agents
during training and GPU availability for network updates, Gazebo was rejected for this
research.
MuJoCo is an intensive physics engine with the highest solver stability, consistency
of results, accuracy of calculations, and energy conservation compared to other physics
environments [27]. Due to the licensing issues (until 18 October 2021 [28]), difficulties with
implementation, and the poor community support, MuJoCo was not selected for training.
PyBullet [29] is a Python-based environment, designed for rapid prototyping and
testing of real-time physics. This environment is based on the Python Bullet Physics engine.
PyBullet does not have prebuilt ROS communication; however, custom ROS nodes can be
testing of real-time physics. This environment is based on the Python Bullet Physics en-
gine. PyBullet does not have prebuilt ROS communication; however, custom ROS nodes
can be written to allow for ROS integration. Due to the ease of modification and imple-
mentation of the PyBullet environment, in addition to the simplicity of parallelization for
training
Robotics 2023, 12, 12[29], PyBullet was selected as the primary environment for training the RL agent. 3 of 19

2.1.2. Framework and Custom Tasks


The simulation written to allow for ROS integration. Due to the ease of modification and implementation
framework implemented for this project was the Gym API(0.19.0),
of the PyBullet environment, in addition to the simplicity of parallelization for training [29],
developed by OpenAI Inc. The
PyBullet base PyBullet
was selected (3.2.1)
as the primary panda model
environment was the
for training cloned from the
RL agent.
Github repository created by Gallouedec, et al. [30] and modified to suit this application.
2.1.2. Framework
The Gym-RL learning process is and Customdown
broken Tasks into a series of episodes. The episodes
The simulation framework
are limited to 50–100 timesteps, to ensure that the implemented
agent focusesfor thisitsproject was the Gym
exploration in the APIvi-(0.19.0),
developed by OpenAI Inc. The base PyBullet (3.2.1) panda model was cloned from the
cinity of the target. Github
Duringrepository
each timestep in the environment, the agent can move for
created by Gallouedec, et al. [30] and modified to suit this application.
1/240s in the simulation. For the robotic
The Gym-RL RL framework,
learning process is broken each episode
down into a begins
series ofwith the agent
episodes. The episodes
initialized in a standardized
are limited home position,
to 50–100 withtothe
timesteps, target
ensure thatobject
the agentinstantiated
focuses its in front of in the
exploration
the agent with a randomvicinityposition.
of the target.
TheDuring
agenteach musttimestep
learnintothe environment,
relate the input the state
agent infor-
can move for
1/240 s in the simulation. For the robotic RL framework, each episode begins with the
mation from the environment to the ideal action command based on the episodic learning
agent initialized in a standardized home position, with the target object instantiated in
cycles. If the task is completed before
front of the agent withthe maximum
a random number
position. of steps
The agent must is reached,
learn to relatethetheepi-
input state
sode is terminated early and the reward per episode is improved.
information from the environment to the ideal action command based on the episodic
To make customlearning
PyBulletcycles. If the task is completed
environments inside ofbefore the maximum number
the Gym-PyBullet model, of steps
severalis reached,
the episode is terminated early and the reward per
interacting features of the model were modified. The main modifications of the base en-episode is improved.
To make custom PyBullet environments inside of the Gym-PyBullet model, several
vironment included reward shaping, target object block instantiation (for pick-and-place),
interacting features of the model were modified. The main modifications of the base
episodic termination,environment
and modifications
included reward to friction
shaping,coefficients
target objectand
blockmaximum
instantiationjoint forces.
(for pick-and-place),
The three custom taskstermination,
episodic created for andtesting the RL
modifications roboticcoefficients
to friction system are and Panda
maximum reach,
joint forces.
Panda grasp, and Panda pick-and-place. The task environments can be viewed in Figure Panda
The three custom tasks created for testing the RL robotic system are Panda reach,
grasp, and Panda pick-and-place. The task environments can be viewed in Figure 2 for each
2 For each task, end effector-based control strategies with prebuilt IK packages were im-
task, end effector-based control strategies with prebuilt IK packages were implemented.
plemented. As the goal of this
As the goalproject was towas
of this project learn high-level
to learn task
high-level taskplanning
planningstrategies,
strategies, end endeffector-
effector-based control (as an
based alternative
control to joint-based
(as an alternative control)
to joint-based was
control) was adopted
adopted to reduce
to reduce thethe
difficulty
difficulty in training.in training.

(a) (b) (c)


Figure 1. Simulation environments. (a) Gazebo, (b) MuJoCo [31], (c) Pybullet.
Figure 1. Simulation environments. (a) Gazebo, (b) MuJoCo [31], (c) Pybullet.
For these tasks, vector-based position feedback was applied. The agent was fed some
combination
For these tasks, vector-based of state positions,
position including
feedback was theapplied.
gripper (x,The
y, z, Vx, Vy, Vz,
agent waspitch,
fed roll,
someyaw), the
target object/objects (x, y, z, Vx, Vy, Vz, pitch, roll, yaw), and in the case of pick-and-place,
combination of state positions, including the gripper (x, y, z, Vx, Vy, Vz, pitch, roll, yaw),
the target block (x, y, z, Vx, Vy, Vz, pitch, roll, yaw) positions. The training operation
the target object/objects (x, y,the
involved z, agent
Vx, Vy, Vz, pitch,
learning roll,xyzg
to provide yaw), and gripper)
(g being in the case
inputof pick-and-
action commands to
place, the target block (x, y, z, Vx, Vy, Vz, pitch, roll, yaw) positions. The
the robot based on the vector of concatenated positions provided to the agent. training
the target block will cause the block to move and/or slide off the table. For this task, the
agent must learn to approach the block by following specific paths. The third task, pick-
and-place, requires the agent to actuate the gripper similarly to grasp. The task complexity
for pick-and-place is significantly higher than for grasp, as the agent must learn the grasp,
Roboticslift
2023,and
12, 12 transportation action sequences. Additionally, for the pick-and-place task, the 4 of 19

agent must learn to avoid collisions with the large placement block.

(a) (b) (c)


Closed gripper, EE

Figure 2. Various Panda Figure 2. Various Panda


environment environment configurations.
configurations. (a) Closed
(a) Closed gripper gripper
with with end-effector
end-effector controlcontrol
[x, y, z]. Target stationary and penetrable. (b) Open gripper (gripper width W) with end-effector
[x,y,z]. Target stationary and penetrable. (b) Open gripper (gripper width W) with end-effector con-
control [x, y, z, W]. Target object is dynamic and impenetrable (c) Open gripper (gripper width W),
trol [x,y,z, W]. Target object is dynamic and impenetrable (c) Open gripper (gripper width W), end-
end-effector control [x, y, z, W]. Target object is dynamic and impenetrable. Placement block on right.
effector control [x,y,z, W]. Target object is dynamic and impenetrable. Placement block on right.
Task difficulty progressively increases, with the first task, reach, being the simplest,
2.1.3. RL Algorithmsand the third task, pick-and-place, being the most difficult. As shown in Figure 2, the first
task, reach, only requires the control of the end effector (EE) XYZ position. The reach target
The two algorithms
object istested and
stationary andcompared forthethis
penetrable, so research
gripper werechanges
cannot cause soft actor–critic
in the target object
(SAC) and proximalposition
policywith
optimization (PPO).
collisions. The secondSAC was selected
task, grasp, due
requires the to to
agent itscontrol
sample the effi-
XYZ of the
ciency for complex problems, and PPO was selected due to its hyperparameter insensitiv- object
EE, as well as the gripper width (G). The task complexity increases because the target
is not stationary and is impenetrable, so any collisions between the EE and the target block
ity and stable convergence characteristics. Table 1 compares some of the key characteris-
will cause the block to move and/or slide off the table. For this task, the agent must learn to
tics of these two methods.
approach the block by following specific paths. The third task, pick-and-place, requires the
agent to actuate the gripper similarly to grasp. The task complexity for pick-and-place is
Table 1. Comparison of SAC and higher
significantly PPO. than for grasp, as the agent must learn the grasp, lift and transportation
action sequences. Additionally, for the pick-and-place task, the agent must learn to avoid
PPO block.
collisions with the large placement SAC
Policy Type On-Policy Off-Policy
2.1.3. RL Algorithms
Q-Learning and Policy
Optimization method Policytested
The two algorithms Optimization
and compared for this research were soft actor–critic (SAC)
and proximal policy optimization (PPO). SAC was selected Optimization
due to its sample efficiency
Update stability
for complex problems, andHigh Low
PPO was selected due to its hyperparameter insensitivity and
stable convergence
Hyperparameter sensitivity characteristics.
Low Table 1 compares some of the key
High characteristics of these
two methods.
Sample efficiency Low High
Table 1. Comparison of SAC and PPO.

Proximal Policy Optimization PPO SAC


PPO is a policy gradient technique
Policy Type which was designed
On-Policy to provide fasterOff-Policy
policy up-
Q-Learning and Policy
dates then previously developed
OptimizationRL algorithms such
method Policyas the advantage actor–critic
Optimization (A2C)
Optimization
Update stability High Low
Hyperparameter sensitivity Low High
Sample efficiency Low High

Proximal Policy Optimization


PPO is a policy gradient technique which was designed to provide faster policy
updates then previously developed RL algorithms such as the advantage actor–critic (A2C)
or deterministic policy gradient (DPG). PPO applies the DPG structure, but updates the
policy parameter θ based on a simple surrogate objective function [32].
Robotics 2023, 12, 12 5 of 19

PPO is designed as an improvement to trust region policy optimization (TRPO). TRPO


optimizes the return of the policy in the infinite-horizon MDP by implementing the loss
function shown below [32],
 
πθ ( a|s)
Maximizeθ Es  (1)
πθold ( a|s)

where the policy parameter θ is maximized based on the product of the ratio of new and
old policies and the advantage function Â. TRPO constrains the updates to the policy
parameter θ through the introduction of the KL divergence constraint. The trust region
constraint can be expressed as shown in the following equation [32,33],

Et [KL(πθold (·|s ), πθ (·|s ))] ≤ δ (2)

where δ is the size of the constraint region. This constraint limits the difference between the
new policy and the old policy to prevent large, unstable updates.
The final loss function for TRPO can be expressed as shown [33],
 
πθ ( a|s)
Maximizeθ E Â − βKL[πθold (·|s ), πθ (·|s )] (3)
πθold ( a|s)

where β is a fixed penalty coefficient. TRPO is overly complicated to solve, as it requires a


conjugating gradient method. PPO has an advantage over the TRPO technique because
it is simpler to solve, due to the reduction of the region constraint to a penalty in the
loss function.
PPO introduces a clipped surrogate objective function, which penalizes any changes
that move the ratio of the new and old policies away from 1. The object function is [33]

Lclip (θ ) = Ê min r (θ ) Â, clip(r (θ ), 1 − e, 1 + e) Â


 
(4)

where e is the clipping hyperparameter (0.1–0.2). By using this objective function, the action
will be clipped in the interval [1 − e, 1 + e] [34].
PPO is a stable training technique as it constantly learns the policy in an on-policy way
through continuous exploration without the use of a replay buffer. The main disadvantage
of PPO is the low sample efficiency, and the convergence to a single deterministic policy.
An alternative to PPO is the SAC technique.

Soft Actor–Critic
SAC is an off-policy actor–critic method, founded on advantage actor–critic (A2C).
SAC was selected over A2C and DPG approaches due to its effectiveness balancing
the exploration–exploitation tradeoff, and its ease of parallelization. SAC balances the
exploration–exploitation tradeoff with entropy regularization, which encourages the agent
to explore based on the “temperature” (uncertainty) at a given time step. The formulation
for the entropy is
H( x ) = E[− log(π ( ∗| x ))], (5)
where π is the probability density function for the policy, and x is a random variable
representing the state.
SAC works by learning the policy and two value functions simultaneously. The policy
formulation is
∞ 
π ∗ = argmax ∑t=0 γR s, a, s0 + αH (s) ,
 
(6)
π

where H ( ) is the entropy of the policy at a given timestep, π ∗ is the optimal policy, γ is the
discount rate (time dependent), and α is the entropy regularization coefficient [35,36]. Here,
Robotics 2023, 12, 12 6 of 19

the entropy serves as a reward for the agent at each time step to encourage or discouraged
exploration. The formulation of the value functions V* and Q* are

Qπ (s, a) = r (s, a) + Es0 V 0∗ s0
 
(7)

h ∗ i
V π (s) = Ea∼π Qπ (s, a) − α H(s) (8)

where α is a dual variable (fixed and varying) and r is the reward given a state action
pairing [35,37]. In the case where alpha is varying, α is formulated as

α ← α + λ E[log(π ∗ ( a| s) + H (s)] (9)

where λ is the learning rate.

2.1.4. The RL Training Codebase


The primary RL codebase implemented for this project was Stable Baselines 3 (SB3) [38].
SB3 (1.4.0) was selected as the primary codebase for testing PPO and SAC, because it can
be easily modified to accommodate the custom Panda environment, and has prebuilt
parallelization, visualization, and GPU integration features. Additionally, SB3 has excellent
supporting documentation, many functioning examples, and is built on PyTorch.
Table 2 compares the primary open-source codebases implemented by the RL commu-
nity. The table indicates that SB3 has a slight advantage over RLlib due to the additional
documentation, and consistent PyTorch backbone. Tianshou was also considered but
rejected due to the lack of documentation, a small user base and limited tutorials.

Table 2. Comparison of RL codebases.

SB3 RLlib Tianshou


Backbone PyTorch PyTorch/TF PyTorch
Documentation Excellent (15 pages) Excellent (11 pages) Good (6 pages)
Number of codebase
tutorials and worked 12 24 7
examples
Last commit <1 week <1 week <2 weeks
Pretrained models Yes Yes No

2.1.5. Optuna
Due to the complex nature of RL and robotic control, hyperparameter tuning is crucial.
RL has all the same hyperparameters as required in supervised learning such as number of
epochs, batch size, choice of activation function, learning rate, and optimization algorithm.
Additionally, RL problems have a range of hyperparameters not required in supervised
learning, such as number of steps (time between updates), gamma (discount factor), and
entropy coefficient (confidence parameter, encourages exploration). During training, it
was noticed that network parameters played a significant role in the speed and stability of
convergence, so the number of hidden layers and the neural network width were treated as
hyperparameters to be optimized.
For RL problems, grid or random search would be unreasonable because >1 million
hyperparameter combinations would be required to achieve solutions close to optimal.
The intelligent hyperparameter search algorithm applied for this project was the Gaus-
sian process-based tree-structured Parzen estimator (TSPE) [39]. For the Panda robot,
the TSPE objective function contained the “reward” metric, which was maximized dur-
ing training [40]. To improve learning speed, the median-tuner pruning technique was
applied. The pruner was set to start pruning after completing 1/3 of the steps for each
hyperparameter trial.
After each Optuna trial was completed, the results of the trial were viewed in the
parallel coordinate plot (PCP) and the hyperparameter importance plot (HIP). PCPs were
parallel coordinate plot (PCP) and the hyperparameter importance plot (HIP). PCPs were
implemented as a tool for comparing hyperparameters, to learn how specific
hyperparameter ranges affected training accuracy. HIPs were implemented to determine
which hyperparameters caused the most significant impact on training results. Figure 3
Robotics 2023, 12, 12 depicts an example of PCP and HIP plots from an Optuna trial. Note that 7the of 19column
values have different number formats for different hyperparameters, i.e., for some
columns, the range is an integer (i.e., number of epochs), some are floating points (i.e.,
implemented
gamma, as a rate,
learning tool for comparing
etc.) and some hyperparameters, to learn howpowers
are integers representing specific (i.e.,
hyperpa-
batch size,
rameter ranges affected training
ranging from 128 (2 ) to 512 (2 )).
7 9 accuracy. HIPs were implemented to determine which
hyperparameters caused the most significant impact on training results. Figure 3 depicts
The PCP is a useful tool when comparing the performance of specific ranges of
an example of PCP and HIP plots from an Optuna trial. Note that the column values
hyperparameters, and the relationships between them. From the PCP shown in Figure 3,
have different number formats for different hyperparameters, i.e., for some columns, the
itrange
is clear
is anthat GAE(i.e.,
integer lambda
number in the range of
of epochs), 0.8–0.9
some with gamma
are floating values
points (i.e., in the
gamma, range of 0.9–
learning
0.95
rate,and
etc.)learning
and somerates in the range
are integers of 0.001–0.005
representing perform
powers (i.e., bestranging
batch size, for PPO for128
from vector-based
(27 )
Panda-grasp 9
to 512 (2 )). with dense rewards.

Figure 3. An example of parallel coordinate plot (PCP) and hyperparameter importance plot (HIP)
Figure 3. training
from the An example
of PPO offor
parallel coordinate
vector-based Pandaplot
grasp(PCP)
with and
densehyperparameter
rewards. importance plot (HIP)
from the training of PPO for vector-based Panda grasp with dense rewards.
The PCP is a useful tool when comparing the performance of specific ranges of
hyperparameters,
For some trials, and the therelationships between
relationship them. From
between thetheobjective
PCP shownvalue
in Figure
and3, itspecific
is clear that GAE lambda in the range of 0.8–0.9 with gamma values in the range of 0.9–0.95
hyperparameters shown in the PCP were not clear. For such cases, HIPs were
and learning rates in the range of 0.001–0.005 perform best for PPO for vector-based
implemented
Panda-grasp with to determine
dense rewards.if the hyperparameters of concern had a significant impact on
training. Hyperparameters
For some with little
trials, the relationship impact
between on training
the objective valuewere
and fixed once
specific a realistic value
hyperparame-
for the hyperparameter was found from literature or from hyperparameter
ters shown in the PCP were not clear. For such cases, HIPs were implemented to determine tuning. The
Results section presents
if the hyperparameters of figures
concernofhadPCPs. Through
a significant the use
impact of HIPs,Hyperparameters
on training. values which had little
with little impact on training were fixed once a realistic value for the hyperparameter was
found from literature or from hyperparameter tuning. The results section presents figures
of PCPs. Through the use of HIPs, values which had little effect on training performance
were identified and removed during early rounds of hyperparameter tuning.

2.1.6. Reward Structure


Reward-shaping plays a critical role in solving RL problems. The two reward struc-
tures implemented for this project were dense (heterogeneous) and sparse (homogeneous)
rewards, shown in Figure 4.
With the standard sparse reward scheme, the agent received a reward of −1 for all
states except the final placement state. The agent had difficulty solving complex RL prob-
lems with this reward scheme due to the Monte Carlo (random) nature of this approach.
The dense reward function implemented for this project was the heterogeneous reinforce-
ment function [41]. In this approach, the reward improved as the agent executes the task
correctly. For example, for reach, the reward improves proportionally to the distance
between the EE and target sphere.
2.1.6. Reward Structure
Robotics 2023, 12, 12 Reward-shaping plays a critical role in solving RL problems. The two reward struc-8 of 19
tures implemented for this project were dense (heterogeneous) and sparse (homogeneous)
rewards, shown in Figure 4.

Figure 4. Sparse (left) vs dense (right) reward shaping.


Figure 4. Sparse (left) vs dense (right) reward shaping.

WithForthethis
standard
project,sparse
dense reward
and sparsescheme, the agent
rewards received afor
were compared reward
reachof −1 grasp.
and for all The
states except the final placement state. The agent had difficulty solving
pick-and-place task required the agent to pick up, transport, and release the target complex RL prob-
block at
lemsthewith this reward scheme due to the Monte Carlo (random) nature of this
correct location. For this task, a standard sparse reward function was too difficult to approach.
The solve,
dense soreward function
only the resultsimplemented
for the densefor this project
reward scheme wasarethe heterogeneous reinforce-
shown.
ment function [41]. In RL
Hierarchical thiswas
approach, the reward
considered for this improved as the agent
project; however, executesproject
to constrain the taskscope,
correctly. For example,
only standard sparseforand
reach, the reward
dense rewardsimproves proportionally
were compared. to the distance
Alternative research be-
[21,42]
tween the EE andthe
investigates target sphere.
use of hierarchical RL for similar applications.
For this project, dense and sparse rewards were compared for reach and grasp. The
2.2. Experiment
pick-and-place taskDesign
requiredandtheRobotic
agentControl
to pick up, transport, and release the target block
at the correct location.
The robot For this task,
implemented for athis
standard
projectsparse
is the reward function Robot
Panda Research was too difficult by
developed
to solve,
Frankaso Emika.
only theThe results
Panda forrobot
the dense reward scheme
was purchased are shown.
as a packaged system which contained the
Hierarchical
arm, RL was
the gripper, considered
the control box,for this project; however,
communication controller,to joint
constrain
lock,project scope,
and a kill switch,
only standard
which
Robotics 2023, 12, x FOR PEER REVIEW all sparse
can be and
vieweddense
in rewards
Figure 5. were compared.
Additional widgetsAlternative
added research
include the [21,42]
Azure in-
Kinect
9 of 20
(for remote
vestigates the use viewing and control)
of hierarchical andsimilar
RL for the workstation
applications.display monitor. The Panda system
was selected for the project, as it is a research tool which allows for quick implementation
2.2. Experiment
and testingDesign and Robotic
for various controlControl
strategies [43].
The robot implemented for this project is the Panda Research Robot developed by
Franka Emika. The Panda robot was purchased as a packaged system which contained
the arm, the gripper, the control box, communication controller, joint lock, and a kill
switch, which all can be viewed in Figure 5. Additional widgets added include the Azure
Kinect (for remote viewing and control) and the workstation display monitor. The Panda
system was selected for the project, as it is a research tool which allows for quick imple-
mentation and testing for various control strategies [43].

Figure
Figure 5. Panda
5. Panda robot
robot lablab setup.
setup.

2.2.1. Panda Robot


The Panda robot consists of the arm: a kinematic chain with seven articulated joints,
and a gripper end effector. A kill switch is included for operator safety, and the joint lock
protects the joints from being back driven while the robot is not in use. Each joint imple-
ments high-precision encoders and torque sensors which enable the robot to have a pose
repeatability of 0.1 mm and a force resolution of 0.05 N. With a maximum gripping force
Robotics 2023, 12, 12 Figure 5. Panda robot lab setup. 9 of 19

2.2.1. Panda Robot


2.2.1.The
Panda
PandaRobot
robot consists of the arm: a kinematic chain with seven articulated joints,
and aThe
gripper end
Panda roboteffector. A kill
consists switch
of the arm:is aincluded
kinematic forchain
operator
withsafety,
seven and the jointjoints,
articulated lock
protects the joints from being back driven while the robot is not in
and a gripper end effector. A kill switch is included for operator safety, and the jointuse. Each joint imple-
ments high-precision
lock protects encoders
the joints and torque
from being back drivensensors which
while theenable
robot the robot
is not to have
in use. Each a pose
joint
repeatability of 0.1 mm andencoders
implements high-precision a force resolution
and torque of sensors
0.05 N. With
whicha enable
maximum gripping
the robot forcea
to have
of 140repeatability
pose N, the robotof can
0.1support
mm and a payload of 3 kg [43],
a force resolution making
of 0.05 it an aideal
N. With choice gripping
maximum for tasks
such
forceasofpick-and-place.
140 N, the robot can support a payload of 3 kg [43], making it an ideal choice for
tasksThe
such as pick-and-place.
Panda was procured alongside the control unit for the device, which is a part of
The Panda
the Franka wasinterface
control procured alongside
(FCI). The FCItheiscontrol unit forfor
the interface thecontrolling
device, which the is a part of
motion of
the Franka
the robotic control
arm from interface
a local(FCI). The FCI is
workstation viatheaninterface
ethernetfor controllingThe
connection. the FCI
motion of the
interface
roboticfor
allows armbidirectional
from a localcommunication
workstation viabetween
an ethernet the connection.
agent and the Theworkstation
FCI interface forallows
posi-
for bidirectional
tional communication
readings (joint measurements, between
desired the agent
joint andexternal
goals, the workstation for positional
torques, collision infor-
readings and
mation), (jointcommands
measurements, desired
(desired joint
torque, goals,
joint external
position, or torques,
velocity,collision
cartesianinformation),
pose or ve-
and commands (desired torque, joint position, or velocity,
locity, and task commands). The communication framework for the FCI cartesian posecan
or be
velocity,
viewedand in
task commands).
Figure 6. The communication framework for the FCI can be viewed in Figure 6.

Figure
Figure 6.
6. The
The FCI
FCI communication
communication framework.
framework.

The FCI allows for 1 kHz signals to be communicated between the workstation PC
and the Panda robot. To enable the use of this high-frequency signal communication, the
Ubuntu workstation requires a real-time Linux Kernel (5.14.2-rt21).

2.2.2. Control Interfaces


The control interface implemented for the Panda is Franka-ROS(noetic). Franka-ROS
is a package which communicates with the FCI via libfranka, a prebuilt C++ network
communication package.
Libfranka(0.8.0) is a package used for basic non-real-time and real-time functions such
as setting controller parameters, processing feedback signals, reading robot states, gener-
ating basic motion paths, and sending torque commands. Franka-ROS was implemented
to wrap libfranka to allow for integration of the FCI into the ROS environment (Figure 6).
Once FCI was integrated inside the ROS environment, other ROS packages such as MoveIt
were applied.
MoveIt(0.1.0) is an open-source library designed for implementing advanced motion
planning, kinematics, control, and navigation strategies [44]. The MoveIt control interface
is more advanced than most other planners, as it incorporates a collision-detection pipeline
prior to allowing action execution.
(Figure 6). Once FCI was integrated inside the ROS environment, other ROS packages
such as MoveIt were applied.
MoveIt(0.1.0) is an open-source library designed for implementing advanced motion
planning, kinematics, control, and navigation strategies [44]. The MoveIt control interface
Robotics 2023, 12, 12is more advanced than most other planners, as it incorporates a collision-detection pipe- 10 of 19
line prior to allowing action execution.

2.2.3. Control Implementation


2.2.3. Control Implementation
The frameworkThe forframework
controlling for thecontrolling
Panda robot is Panda
the presentedrobotinisFigure
presented 7. The inexpla-
Figure 7. The expla-
nation for the control
nation for the control cycle is as follows. The Panda sends the jointover
cycle is as follows. The Panda sends the joint state information state information
a ROS node to over
PyBullet.
a ROSPyBullet
node toreceives
PyBullet.thePyBullet
joint information
receives the andjoint
instantiates
information the simu-
and instantiates the
lated Panda insimulated
the associated
Pandaposition. After the
in the associated PyBullet
position. environment
After the PyBullet is environment
created, the is created, the
Panda state information
Panda stateisinformation
fed into theisRL fedagent,
into the andRLthe agent
agent, andoutputs
the agent anoutputs
action com-
an action command
mand (XYZG).(XYZG).
The jointThe
states required
joint to followtothe
states required particular
follow action command
the particular action commandare cal- are calculated
culated inside inside
of PyBullet with the
of PyBullet usethe
with of use
the of inverse kinematic
the inverse package
kinematic (Samuel
package Buss Buss Inverse
(Samuel
Inverse Kinematics Library) [45]. After the joint trajectories are calculated,
Kinematics Library) [45]. After the joint trajectories are calculated, the action the action is is executed in
executed in thethe
PyBullet environment, and the joint positions are published to
PyBullet environment, and the joint positions are published to the MoveIt framework.the MoveIt
framework. MoveIt
MoveItaccepts
accepts thethe
joint positions
joint positions andandexecutes
executes thetheaction
actionafter
afterchecking
checking thethe safety of the
safety of the action
actionwith
withthe
thecollision-detection
collision-detectionpipeline.
pipeline.OnceOncethe theaction
action isis completed
completed in the real world,
the real world,and
andthe
theposition
positionofofthe theend
endeffector
effectorisiswithin
withinthe
theaccuracy
accuracythreshold,
threshold,the theprocess repeats.
process repeats.TheThe agent
agent iteratively
iteratively steps
steps through
through thisthis control
control cycle,
cycle, until
until thethe
tasktask is
is completed in the
completed in thesimulated
simulated world.
world.

Figure 7. Framework for control communication between the PyBullet simulator, the real-world
Figure 7. Framework for control communication between the PyBullet simulator, the real-world
robot, the ROS system and the RL agent.
robot, the ROS system and the RL agent.3. Results and Discussion.
Both simulation and real-world testing were completed for this project. Section 3.1
depicts the results from hyperparameter tuning and simulation training. Section 3.2 shows
the accuracy of real-world testing after simulation training.

3. Results and Discussion


3.1. Simulation Results
The section below presents the simulation analysis completed for training the robotic
agent. Parallel coordinate plots were implemented as a tool for hyperparameter tuning for
each environment. After tuning, convergence plots were implemented to depict the success
of each tuned model.
Task complexity had a major impact on network design and hyperparameter values.
A range of hyperparameter combinations could be found quickly for the simple reach and
grasp tasks. For the pick-and-place task, extensive tuning (100 + trials) was required.

3.1.1. Panda Reach and Panda Grasp with Dense Rewards


This section reviews the results for the Panda reach and grasp tasks with dense reward
functions. The grasp problem was more complex than the reach problem because the EE
had to learn to grasp the target while not bumping it off the table. The reward structure
for this problem is dense. Every time the agent takes a step in a direction, the reward
is increased or decreased based on the relative distance between the end effector and
the target.
This section reviews the results for the Panda reach and grasp tasks with dense re-
ward functions. The grasp problem was more complex than the reach problem because
the EE had to learn to grasp the target while not bumping it off the table. The reward
structure for this problem is dense. Every time the agent takes a step in a direction, the
Robotics 2023, 12, 12 11 of 19
reward is increased or decreased based on the relative distance between the end effector
and the target.
Several sets of hyperparameter studies were completed for these tasks. The reach task
Several sets of hyperparameter studies were completed for these tasks. The reach
was relatively simple to solve and required minimal hyperparameter tuning. The grasp
task was relatively simple to solve and required minimal hyperparameter tuning. The
task had higher complexity and required several studies. The grasp hyperparameters per-
grasp task had higher complexity and required several studies. The grasp hyperparameters
formed well well
performed when applied
when to the
applied reach
to the task,
reach sosothe
task, thePCP
PCPfor
forthe
the grasp taskisispresented
grasp task presented in
Figure 8. 8.
in Figure

Figure 8. Left: PCP for PPO Panda grasp with dense rewards. Agent receives the highest objective
Figure
value8.with
Left: PCP for
a batch sizePPO
of 28Panda grasp
and three with denseRight:
environments. rewards.
PCPAgent
for SACreceives the highest
Panda grasp objective
with dense
value with Agent
rewards. a batchreceives 28 and
size ofthe three
highest environments.
objective value withRight:
a batchPCP
size for
of 2SAC Panda
8 or 210 and agrasp with
learning dense
rate
rewards. Agent receives the highest objective value with a batch size of 28 or 210 and a learning rate
of ~0.0075.
of ~0.0075.
As shown in Figure 9, Panda reach was trained efficiently for both PPO and SAC.
TheAsPPO model
shown intook 2.9 million-time
Figure 9, Panda reach steps to converge
was to a reach accuracy
trained efficiently for both greater
PPO than
Robotics 2023, 12, x FOR PEER REVIEW 12 and
of 20 SAC.
− 0.75. The SAC model converged very quickly, breaking the average
The PPO model took 2.9 million-time steps to converge to a reach accuracy greater reward of − 0.75 afterthan
only 0.04 million steps. With this reward range, both the PPO and SAC agents
−0.75. The SAC model converged very quickly, breaking the average reward of −0.75 after had a 100%
success rate on the reach task.
only 0.04 million steps. With this reward range, both the PPO and SAC agents had a 100%
success rate on the reach task.

Figure 9. Left: PPO convergence for Panda reach with dense rewards. Right: SAC convergence for
Figure 9. Left: PPO convergence for Panda reach with dense rewards. Right: SAC convergence for
Panda reach with dense rewards.
Panda reach with dense rewards.

AsAsshown
shownininFigure
Figure10.
10.Panda
Panda grasp
grasp was
wastrained efficiently
trained efficientlyforfor
both PPO
both PPOand
andSAC.
SAC.
The PPO model took 4.0 million-time steps to converge to an average accuracy
The PPO model took 4.0 million-time steps to converge to an average accuracy greater than greater
than
−0.5.−0.5.
TheThe SACSAC model
model converged
converged veryvery quickly,
quickly, breaking
breaking the average
the average reward
reward of −of
0.5−0.5
after
after only 0.08 million steps. With these rewards, PPO was able to successfully
only 0.08 million steps. With these rewards, PPO was able to successfully complete the task complete
the
fortask
89%forof 89% of the attempts,
the attempts, compared
compared to the
to the SAC SACwhich
agent agent waswhich
ablewasto able to complete
complete the task
the
fortask
92%forof 92% of the attempts.
the attempts.
As shown in Figure 10. Panda grasp was trained efficiently for both PPO and SAC.
The PPO model took 4.0 million-time steps to converge to an average accuracy greater
than −0.5. The SAC model converged very quickly, breaking the average reward of −0.5
after only 0.08 million steps. With these rewards, PPO was able to successfully complete
the task for 89% of the attempts, compared to the SAC agent which was able to complete
Robotics 2023, 12, 12 12 of 19
the task for 92% of the attempts.

Figure
Robotics 2023, 12, x FOR PEER 10.Figure
REVIEW Left: PPO convergence
10. Left: for Pandafor
PPO convergence grasp with
Panda dense
grasp rewards.
with Right: SAC
dense rewards. convergence
Right: for
SAC convergence for13 of 20
Panda grasp with dense rewards.
Panda grasp with dense rewards.

3.1.2.Reach
3.1.2. Panda Pandaand
Reach andGrasp
Panda Pandawith
Grasp withRewards
Sparse Sparse Rewards
This reviews
This section section the
reviews
results theforresults for the
the Panda Panda
reach and reach
grasp and
tasksgrasp
with tasks
sparsewith
re- sparse
reward
ward functions.Asfunctions.
shown
Like inLike
section Figure
3.1.1., 12,
SectionthisPanda
3.1.1, reach
this
problem problemwas required
required trained
the Panda efficiently
thereach
Panda fortoboth
reach
agent agent
learn PPO and SAC.
to learn
The
the PPO model
relationship took
between5.6 million-time
the reward andsteps
the to
EE converge
XYZ to an
coordinates,
the relationship between the reward and the EE XYZ coordinates, and the target XYZ co- accuracy
and greater
the target than
XYZ −2.75.
coordinates.
TheThe
ordinates. SAC The difference
model between
difference converged between
thisquickly, this Section,
Section, breaking
and Section and Section
the3.1.1
average 3.1.1 is
is the reward the
reward thereward
of −1.9 the agent
agentafter only 0.16
receives
receivesmillion duringWith
during exploration.
steps. exploration.
For
thisthe For the
sparse
reward sparse
reward
range, reward
and scheme,
scheme,
PPO the
SAC agent the agent
had receives
receives
agents a reward
100% a of
reward
success of
rates.
−1 after
−1 after each step,each step,
unless unless
the agent the agent
reaches reaches
the target the target
position position
and and
correctly correctly
completes
As can be seen in Figure 13., Panda grasp was trained efficiently for both PPO and completes
the the
task.
task. SAC. The PPO model took 5.5 million-time steps to converge to an average accuracy
Several Several
sets sets of hyperparameter
of hyperparameter studies studies
were were completed
completed for these for these
tasks tasks 11).
(Figure (Figure 11).
greater than −2.75. The SAC model converged quickly, breaking the average reward of
Compared to the same tasks with sparse rewards, the networks
Compared to the same tasks with sparse rewards, the networks for these tasks had to be for these tasks had to be
−3.1
deeperafter only
andThe 0.14
wider. million
The steps. With
grasp hyperparameters this reward
performed range,
well the PPO and
when applied SAC
to the reach were
agents
deeper and wider. grasp hyperparameters performed well when applied to the reach
able
task,tosocomplete
only one theoftask
set with 90% and
hyperparameter 95% was
tuning success rates, respectively.
required.
task, so only one set of hyperparameter tuning was required.

Figure 11. Left: PCP for PPO Panda grasp with sparse rewards. Agent receives the highest objective
Figure 11. Left:
value with PCP rate
a learning for PPO
~0001,Panda grasp with sparse
two environments, and a rewards. Agentofreceives
network depth theRight:
five layers. highest objective
PCP
value
for SAC Panda grasp with sparse rewards. Agent receives the highest objective value with an entropy PCP
with a learning rate ~0001, two environments, and a network depth of five layers. Right:
for SAC Panda
coefficient ~0.002,grasp
batchwith sparse
size of rewards.
210 , learning rateAgent receives
~0.0015, the highest
and gamma objective value with an en-
of ~0.96.
tropy coefficient ~0.002, batch size of 2 , learning rate ~0.0015, and gamma of ~0.96.
10

As shown in Figure 12, Panda reach was trained efficiently for both PPO and SAC. The
PPO model took 5.6 million-time steps to converge to an accuracy greater than −2.75. The
SAC model converged quickly, breaking the average reward of −1.9 after only 0.16 million
steps. With this reward range, PPO and SAC agents had 100% success rates.
Figure 11. Left: PCP for PPO Panda grasp with sparse rewards. Agent receives the highest objective
value with a learning rate ~0001, two environments, and a network depth of five layers. Right: PCP
Robotics 2023, 12, 12 for SAC Panda grasp with sparse rewards. Agent receives the highest objective value 13 with
of 19an en-
tropy coefficient ~0.002, batch size of 2 , learning rate ~0.0015, and gamma of ~0.96.
10

Figure 12. Left: PPO convergence for Panda reach with sparse rewards. Right: SAC convergence for
Figure 12. Left: PPO convergence for Panda reach with sparse rewards. Right: SAC convergence
Panda reach with sparse rewards.
for Panda reach with sparse rewards.
As can be seen in Figure 13, Panda grasp was trained efficiently for both PPO and SAC.
Robotics 2023, 12, x FOR PEER REVIEW 14 of 20
The PPO model took 5.5 million-time steps to converge to an average accuracy greater than
−2.75. The SAC model converged quickly, breaking the average reward of −3.1 after only
0.14 million steps. With this reward range, the PPO and SAC agents were able to complete
the task with 90% and 95% success rates, respectively.

Figure 13. Left: PPO convergence for Panda grasp with sparse rewards. Right: SAC convergence for
Figure 13. Left: PPO convergence for Panda grasp with sparse rewards. Right: SAC convergence
Panda grasp grasp
for Panda with sparse rewards.
with sparse rewards.
The contrast between sparse and the dense rewards can be understood by comparing
The contrast
this section between
with Section 3.1.1.sparse and the
With sparse densefunctions,
reward rewards canthe be understood
agent required by comparing
approxi-
this section
mately twice aswith
manySection
training3.1.1. With
steps, sparseadditional
because reward functions,
exploration thewas
agent required
required approxi-
to find
mately twice as many training
the states resulting in positive rewards. steps, because additional exploration was required to find
theThe
states resulting
agent in positive
consistently receivedrewards.
a high negative reward with the dense scheme (range
of −1.5The
to −agent consistently
3), because received
the reward is −1aper
high negative
step reward
rather than beingwith the on
based dense scheme
position. For(range
ofreach
the −1.5 to −3),both
task, because
sparsethe andreward
dense is −1 per schemes
rewards step rather than in
resulted being basedcompleting
the agent on position. For
the
the reach
task withtask,
a 100%both sparserate.
success andFordense
the rewards
grasp task,schemes resultedwith
agents trained in the agent
sparse completing
rewards
outperformed
the task withagents
a 100%trained with
success dense
rate. Forrewards
the graspby task,
an average
agentsoftrained
2%. with sparse rewards
These resultsagents
outperformed indicate that for
trained withproblems with minimal
dense rewards difficulty,
by an average of a2%.
simple, sparse,
deterministic reward signal is most effective. The main disadvantage
These results indicate that for problems with minimal difficulty, of the sparse rewardsparse,
a simple,
deterministic reward signal is most effective. The main disadvantage of the sparse reward
scheme is that significant exploration and clock-time are required for the policy to obtain
convergence.

3.1.3. Panda Pick-and-Place with Dense Rewards


Robotics 2023, 12, 12 14 of 19

scheme is that significant exploration and clock-time are required for the policy to obtain
convergence.

3.1.3. Panda Pick-and-Place with Dense Rewards


This section reviews the results for the Panda pick-and-place task with dense rewards.
For the Panda pick-and-place problem, the agent had to learn the relationship between
the reward and the end effector XYZ coordinates, the target XYZ coordinates, and the
placement location XYZ position. Of all the tasks tested, pick-and-place had the highest
complexity. A dense reward function was implemented, because sparse rewards make the
Robotics 2023, 12, x FOR PEER REVIEWtask intractable. 15 of 20
Several sets of hyperparameter studies were completed for this task. The results from
hyperparameter tuning can be seen in the PCP’s shown in Figure 14. Due to the problem
complexity, the training algorithm required high-entropy coefficients and a high action
noise to increase the exploration of the solution space. Deeper networks with 5-7 layers
were required to solve this problem optimally.

Figure 14. Left: Parallel coordinate plot (PCP) for PPO Panda pick-and-place with dense rewards.
Figure
Agent14. Left: the
receives Parallel coordinate
highest plot (PCP)
objective value for PPO
with batch sizesPanda pick-and-place
of 210 or with
211 , learning rate dense rewards.
of ~0.00075,
Agent receives the highest objective value with batch sizes of 2 10 or 211, learning rate of ~0.00075, and
and a network depth of seven layers. The batch size is shown. Right: Parallel coordinate plot (PCP)
a network depthpick-and-place
for SAC panda of seven layers.
withThe batch
dense size isAgent
rewards. shown. Right:
receives theParallel coordinate
highest objective plot
value (PCP) for
with
SAC panda
batch pick-and-place
sizes of 2 , 10 learningwith
10 starts,dense rewards.
and five Agent receives the highest objective value with
environments.
batch sizes of 210, 10 learning starts, and five environments.
As can be seen in Figure 15, Panda pick-and-place was trained efficiently with both
PPO and SAC. The PPO model took 8.3 million-time steps to converge to a reach reward
greater than −7.0 and the SAC model took 0.47 million-time steps to converge to a reach
reward greater than −7.0. With this reward range, the PPO and SAC agents could complete
the task with 85% and 71% success rates, respectively.
The use of sparse rewards means that there are many suboptimal solutions for this
problem. One suboptimal solution involves pushing the target object toward the target
position while keeping the block on the table. This solution increases the reward, because
the distance between the target object and the target position is decreased; however, this
solution does not involve grasping and lifting the block to complete the task. Both the SAC
and PPO solutions have a rolling reward convergence, due to the agent first learning these
suboptimal solutions.

Figure 15. Left: PPO convergence for Panda pick-and-place with dense rewards. Right: SAC con-
vergence for Panda pick-and-place with dense rewards.

The use of sparse rewards means that there are many suboptimal solutions for this
Figure 14. Left: Parallel coordinate plot (PCP) for PPO Panda pick-and-place with dense rewards.
Agent receives the highest objective value with batch sizes of 210 or 211, learning rate of ~0.00075, and
a network depth of seven layers. The batch size is shown. Right: Parallel coordinate plot (PCP) for
SAC panda pick-and-place with dense rewards. Agent receives the highest objective value with
Robotics 2023, 12, 12 batch sizes of 210, 10 learning starts, and five environments. 15 of 19

Figure15.
Figure 15. Left:
Left: PPO
PPO convergence
convergence for
for Panda
Panda pick-and-place
pick-and-placewith
withdense
denserewards.
rewards. Right:
Right: SAC
SAC con-
vergence for Panda pick-and-place with dense rewards.
convergence for Panda pick-and-place with dense rewards.

3.1.4. The
Summary
use of sparse rewards means that there are many suboptimal solutions for this
Given
problem. One sufficient hyperparameter
suboptimal tuning and
solution involves reward
pushing shaping,
the both PPO
target object and the
toward SACtarget
agents
positionwere ablekeeping
while to effectively learn on
the block optimal control
the table. policies
This for reach,
solution grasp,
increases theand pick-and-
reward, because
place. For relatively simple reach and grasp tasks, sparse and dense rewards
the distance between the target object and the target position is decreased; however, this performed
well. Thedoes
solution mainnotconsequence of using
involve grasping and sparse rewards
lifting is the
the block extensivethe
to complete hyperparameter
task. Both the SAC
tuning required and the increase in training time. Due to the task complexity for pick-and-
and PPO solutions have a rolling reward convergence, due to the agent first learning these
place, the dense reward scheme was implemented. After extensive training, the agent was
suboptimal solutions.
able to complete the task with an average accuracy of 78%. Table 3 shows a summary of
all the simulation results. The results for pick-and-place were noticeably lower than some
3.1.4. Summary
of the results from the literature. The primary rationale for this difference is the obstacle
Given
avoidance sufficient
which must behyperparameter tuning and reward
learned in this environment. shaping, both
The pick-and-place task PPO and SAC
simulated
agents
here were
does not able
only to effectively
require learn
the agent to optimal
move thecontrol policies
target block to afor reach, in
position grasp,
space,and
butpick-
and-place.
also requiresFor
the relatively simple
agent to avoid the reach and block
placement graspwhile
tasks,completing
sparse and dense
this rewards
motion. Most per-
of the failures
formed well.noticed
The in simulation
main were dueof
consequence to interference
using sparse between the gripper
rewards is theandextensive
the
large target block. Although this change makes the task significantly more difficult, the
results are more realistic for application in the real world.

Table 3. RL performance on simulated tasks.

Positional
RL Task Success
Simulated Problem Reward Feedback
Implementation Rate (%)
Method
Panda reach Dense Vector PPO 100
Panda reach Dense Vector SAC 100
Panda reach Sparse Vector PPO 100
Panda reach Sparse Vector SAC 100
Panda grasp Dense Vector PPO 89
Panda grasp Dense Vector SAC 92
Panda grasp Sparse Vector PPO 90
Panda grasp Sparse Vector SAC 95
Panda pick-and-place Dense Vector PPO 85
Panda pick-and-place Dense Vector SAC 71

The comparison of PPO and SAC reveals several patterns in training time, perfor-
mance, and convergence. For all problems, SAC required a minimum of one order of mag-
nitude fewer steps than PPO to obtain convergence. The SAC advantage in training time
stands in stark contrast to SACs convergence difficulties. SAC was highly hyperparameter-
sensitive compared to PPO, which resulted in additional time being spent to determine the
Robotics 2023, 12, 12 16 of 19

ideal hyperparameters for each task. For the simple reach and grasp problems SAC con-
verged to a more optimal solution than PPO; however, for the more difficult pick-and-place
problem, PPO significantly outperformed SAC.
The results from the comparison of SAC and PPO are intuitive. SAC implements
entropy maximization and off-policy training to reduce training time. The consequence
of these training principles is that SAC is sample-efficient but tends to converge to a local
optimum. PPO maintained slow, consistent convergence and SAC tended to diverge when
overtrained.

3.2. Real-World RL
After all simulated RL tasks were developed, tuned, and trained, the networks were
tested in the real world. For each implementation, 10 tests were completed to approximate
the real-world testing accuracy. Due to the stochastic nature of the PPO and SAC policies,
the planned path and final grasp position were different for each attempted grasp. For
each grasp position, two grasp attempts were made. This testing serves to validate the
accuracy of the PyBullet digital twin, and the potential for this methodology for real-world
RL implementation. Some of the sample real and simulated test grasps and pick-and-place
actions can be viewed in the Supplementary Material.
During testing, the Panda agent was instantiated in PyBullet, and each incremental
step the agent took in the simulation space was replicated in the real world. After each
action step was executed in PyBullet, the agent waited to take a new step until the real-
world arm moved to match its digital twin. During each step, the Franka-ROS feedback
loop ensured the positional accuracy of the end effector. The incremental stepping approach
was implemented to slow down the real-world testing to prevent damage to the robot
during collisions with the mounting table or any objects in the robot’s vicinity.
Table 4 depicts the real-world performance of the RL agent for each task. As shown,
the reach and grasp tasks were completed with relatively high accuracy. Two minor issues
that caused a reduction in task completion rate during testing were (1) minor difference in
geometry of the tested object vs. the simulated object, and (2) the calibration of the physics
environment.

Table 4. Real-world task performance.

Positional
RL Task Success
Real Problem Reward Feedback
Implementation Rate
Method
Panda reach Dense Vector PPO 90
Panda reach Dense Vector SAC 90
Panda grasp Dense Vector PPO 70
Panda grasp Dense Vector SAC 80
Panda pick-and-place Dense Vector PPO 70
Panda pick-and-place Dense Vector SAC 60

Future work that could improve real world testing includes the following. First, the
simulated or real-world target could be modified so both target objects match. This change
is relatively small but will prevent geometry differences from causing failure. Second, the
simulation environment can be upgraded to the recently open-sourced simulation package
MuJoCo. Due to the higher accuracy of this environment, and the low computational cost,
this change would improve sim-to-real transfer while not increasing training time. Finally,
positional sensing could be applied to the real-world target block to ensure that the real
and simulated target positions match during each trained task.

4. Conclusions
Considerable progress has been made in this project toward the goal of creating simu-
lated and real-world autonomous robotic agents capable of performing tasks such as reach,
Robotics 2023, 12, 12 17 of 19

grasp, and pick-and-place. To achieve this goal, custom representative simulation environ-
ments were created, a combined RL and traditional control methodology was developed, a
custom tuning pipeline was implemented, and real-world testing was completed.
Through the extensive tuning of the RL algorithms; SAC and PPO, optimal hyperpa-
rameter combinations and network designs were found. The results of this training were
implemented to complete a comparison between PPO and SAC for robotic control. The
comparison indicates that PPO performs best when the task is complex (involves object
avoidance) and time is readily available. SAC performs best when the task is simple and
time is limited.
After optimal SAC and PPO hyperparameters were found, SAC and PPO algorithms
were connected with the Libfranka, Franka-ROS, and MoveIt control packages to test the
connection between the simulated PyBullet agent and the real-world Panda robot. Real-
world testing was conducted to validate the novel communication framework developed
for the simulated and real environments, and to verify that real-world policy transference
is possible.
During real-world testing, the accuracy of the reach, grasp, and pick-and-place tasks
were reduced by 10–20% compared to the simulation environment. This result provides
and optimistic indication of the future applicability of this method, and also indicates that
further calibration of the simulation environment and modifications to the target object
are required.

Supplementary Materials: The following supporting information can be downloaded at:


https://www.mdpi.com/article/10.3390/robotics12010012/s1, Video S1: RL GRASP. Real and sim-
ulated grasp and pick-and-place testing.
Author Contributions: Conceptualization, A.L.; methodology, A.L.; writing—original draft prepa-
ration, A.L.; writing—review and editing, H.-J.K.; supervision, H.-J.K.; project administration, H.-
J.K.; funding acquisition, H.-J.K. All authors have read and agreed to the published version of the
manuscript.
Funding: This research was supported by the Korea-Canada Artificial Intelligence Joint Research
Center at the Korea Electrotechnology Research Institute (Operation Project: No. 22A03009), which is
funded by Changwon City, Korea.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author. The data are not publicly available due to other continuing research on
this topic.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Massa, D.; Callegari, M.; Cristalli, C. Manual Guidance for Industrial Robot Programming; Emerald Group Publishing Limited:
Bingley, UK, 2015; pp. 457–465. [CrossRef]
2. Biggs, G.; Macdonald, B. A Survey of Robot Programming Systems; Society of Robots: Brisbane, Australia, 2003; p. 27.
3. Saha, S.K. Introduction to Robotics, 2nd ed.; McGraw Hill Education: New Delhi, India, 2014.
4. Craig, J. Introduction to Robotics Mechanics and Control; Pearson Education International: Upper Saddle River, NJ, USA, 2005.
5. Al-Selwi, H.F.; Aziz, A.A.; Abas, F.S.; Zyada, Z. Reinforcement Learning for Robotic Applications with Vision Feedback. In
Proceedings of the 2021 IEEE 17th International Colloquium on Signal Processing & Its Applications (CSPA), Langkawi, Malaysia,
5–6 March 2021.
6. Tai, L.; Zhang, J.; Liu, M.; Boedecker, J.; Burgard, W. A Survey of Deep Network Solutions for Learning Control in Robotics: From
Reinforcement to Imitation. arXiv 2016, arXiv:1612.07139.
7. Kober, J.; Bagnell, A.; Peters, J. Reinforcement Learning in Robotics: A Survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [CrossRef]
8. Liu, D.; Wang, Z.; Lu, B.; Cong, M.; Yu, H.; Zou, Q. A Reinforcement Learning-Based Framework for Robot Manipulation Skill
Acquisition. IEEE Access 2020, 8, 108429–108437. [CrossRef]
9. Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al.
Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. In Proceedings of the 2nd Conference on Robot
Learning, Zürich, Switzerland, 29 October 2018.
10. Mohammed, M.Q.; Chung, K.L.; Chyi, C.S. Pick and Place Objects in a Cluttered Scene Using Deep Reinforcement Learning. Int.
J. Mech. Mechatron. Eng. 2020, 20, 50–57.
Robotics 2023, 12, 12 18 of 19

11. Liu, R.; Nageotte, F.; Zanne, P.; de Mathelin, M.; Drespp-Langley, B. Deep Reinforcement Learning for the Control of Robotic
Manipulation: A Focussed Mini-Review. arXiv 2021, arXiv:2102.04148.
12. Kleeberger, K.; Bormann, R.; Kraus, W.; Huber, M. A Survey on Learning-Based Robotic Grasping. Curr. Robot. Rep. 2020, 1,
239–249. [CrossRef]
13. Xiao, Y.; Katt, S.; ten Pas, A.; Chen, S.; Amato, C. Online Planning for Target Object Search in Clutter under Partial Observability.
In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019.
14. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA; London, UK, 2018.
15. Russell, S.; Norvig, P. Artificial Intelligence A Modern Approach, 4th ed.; Pearson Education, Inc.: Hoboken, NJ, USA;
ISBN 978-0-13-461099-3.
16. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal
Process. Magazine 2017, 34, 26–38. [CrossRef]
17. Ng, A.; Harada, D.; Russell, S. Policy invariance under reward transformations theory and application to reward shaping. In
Proceedings of the Sixteenth International Conference on Machine Learning, San Francisco, CA, USA, 27 June 1999; pp. 278–287.
18. Gualtieri, M.; Pas, A.; Platt, R. Pick and Place Without Geometric Object Models; IEEE: Brisbane, QLD, Australia, 2018; pp. 7433–7440.
19. Gualtieri, M.; Platt, R. Learning 6-DoF Grasping and Pick-Place Using Attention Focus. arXiv 2018, arXiv:1806.06134.
20. Pore, A.; Aragon-Camarasa, G. On Simple Reactive Neural Networks for Behaviour-Based Reinforcement Learning. In Proceed-
ings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020.
21. Li, B.; Lu, T.; Li, J.; Lu, N.; Cai, Y.; Wang, S. ACDER: Augmented Curiosity-Driven Experience Replay. In Proceedings of the 2020
IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 21 August 2020; pp. 4218–4224.
22. Marzari, L.; Pore, A.; Dall’Alba, D.; Aragon-Camarasa, G.; Farinelli, A.; Fiorini, P. Towards Hierarchical Task Decomposition
Using Deep Reinforcement Learning for Pick and Place Subtasks. arXiv 2021, arXiv:2102.04022.
23. Pedersen, M.; Nalpantidis, L.; Andersen, R.; Schou, C.; Bøgh, S.; Krüger, V.; Madsen, O. Robot skills for manufacturing: From
concept to industrial deployment. Robot. Comput.-Integr. Manuf. 2016, 37, 282–291. [CrossRef]
24. Lobbezoo, A.; Qian, Y.; Kwon, H.-J. Reinforcement Learning for Pick and Place Operations in Robotics: A Survey. Robotics 2021,
10, 105. [CrossRef]
25. Mohammed, M.; Kwek, L.; Chua, S. Review of Deep Reinforcement Learning-Based Object Grasping: Techniques, Open
Challenges, and Recommendations. IEEE Access 2020, 8, 178450–178481. [CrossRef]
26. Howard, A. Gazebo. Available online: http://gazebosim.org/ (accessed on 20 September 2022).
27. Erez, T.; Tassa, Y.; Todorov, E. Simulation Tools for Model-Based Robotics: Comparison of Bullet, Havok, MuJoCo, ODE and
PhysX. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30
May 2015; pp. 4397–4404.
28. DeepMind Opening Up a Physics Simulator for Robotics. Available online: https://www.deepmind.com/blog/opening-up-a-
physics-simulator-for-robotics (accessed on 11 July 2022).
29. Coumans, E. Tiny Differentiable Simulator. Available online: https://pybullet.org/wordpress/ (accessed on 10 June 2022).
30. Gallouédec, Q.; Cazin, N.; Dellandréa, E.; Chen, L. Multi-Goal Reinforcement Learning Enviroments for Simulated Franka Emika
Panda Robot. arXiv 2021, arXiv:2106.13687.
31. Shahid, A.A.; Piga, D.; Braghin, F.; Roveda, L. Continuous Control Actions Learning and Adaptation for Robotic Manipulation
through Reinforcement Learning. Autonomous Robots 2022, 46, 483–498. [CrossRef]
32. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017,
arXiv:1707.06347.
33. Karagiannakos, S. Trust Region and Proximal Policy Optimization (TRPO and PPO). Available online: https://theaisummer.
com/TRPO_PPO/ (accessed on 13 December 2021).
34. Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. arXiv 2015, arXiv:1502.05477.
35. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a
Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018.
36. Tuomas, H.; Zhou, A.; Hartikainen, K.; Tucker, G. Soft Actor-Critic Algorithms and Applications. arXiv 2019, arXiv:1812.05905v2.
37. Haarnoja, T.; Ha, S.; Zhou, A.; Tan, J.; Tucker, G.; Levine, S. Learning To Walk via Deep Reinforcement Learning. arXiv 2019,
arXiv:1812.11103.
38. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning
Implementations. J. Mach. Learn. Res. 2021, 22, 1–8.
39. Bergstra, J.; Bardenet, R.; Bengio, Y.; Kegl, B. Algorithms for Hyper-Parameter Optimization; Curran Associates Inc.: Granada, Spain,
2011; pp. 2546–2554.
40. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In
Proceedings of the Applied Data Science Track Paper, Anchorage, AK, USA, 4 August 2019.
41. Mataric, M.J. Reward functions for accelerated learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The
Netherlands, 1994; pp. 181–189.
42. Anca, M.; Studley, M. Twin Delayed Hierarchical Actor-Critic. In Proceedings of the 2021 7th International Conference on
Automation, Robotics and Applications (ICARA), Prague, Czech Republic, 4–6 February 2021.
Robotics 2023, 12, 12 19 of 19

43. Franka Emika. Data Sheet Robot—Arm & Control. Available online: https://pkj-robotics.dk/wp-content/uploads/2020/09/
Franka-Emika_Brochure_EN_April20_PKJ.pdf (accessed on 13 July 2021).
44. Görner, M.; Haschk, R.; Ritter, H.; Zhang, J. MoveIt! Task Constructor for Task-Level Motion Planning. In Proceedings of the 2019
International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019.
45. Coumans, E.; Bai, Y. PyBullet Quickstart Guide. Available online: https://docs.google.com/document/d/10sXEhzFRSnvFcl3
XxNGhnD4N2SedqwdAvK3dsihxVUA/edit#heading=h.2ye70wns7io3 (accessed on 12 March 2022).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like