robotics-12-00012-v2
robotics-12-00012-v2
Article
Simulated and Real Robotic Reach, Grasp, and Pick-and-Place
Using Combined Reinforcement Learning and
Traditional Controls
Andrew Lobbezoo * and Hyock-Ju Kwon
Abstract: The majority of robots in factories today are operated with conventional control strategies
that require individual programming on a task-by-task basis, with no margin for error. As an
alternative to the rudimentary operation planning and task-programming techniques, machine
learning has shown significant promise for higher-level task planning, with the development of
reinforcement learning (RL)-based control strategies. This paper reviews the implementation of
combined traditional and RL control for simulated and real environments to validate the RL approach
for standard industrial tasks such as reach, grasp, and pick-and-place. The goal of this research is to
bring intelligence to robotic control so that robotic operations can be completed without precisely
defining the environment, constraints, and the action plan. The results from this approach provide
optimistic preliminary data on the application of RL to real-world robotics.
is room for further exploration and research for tasks with long action sequences (pick-
and-place), there is a need for one-to-one comparisons between RL methods, and there
is an absence of real-world testing to validate the applicability of RL outside of simula-
tion. A detailed review on the current state of RL for robotic research can be found in
Lobbezoo et al. [24], Mohammed and Chua [25], Liu et al. [11], and Tai et al. [6].
1.3. Objective
The principal objective of this research is to explore the application of RL to sim-
ulated and real-world robotic agents to develop a method for replacing high-level task
programming. The objective has been broken down into the following subobjectives: (1) the
development of a pipeline for training robotic agents in simulation, (2) the training of vari-
ous models and the comparison of performance between each, and (3) testing of the RL
control system in the real world.
1.4. Contribution
The novel contributions of this research to the field can be summarized as follows.
1. We developed a novel pipeline for combining traditional control with RL to vali-
date the applicability of RL for end-to-end high-level robotic arm task planning and
trajectory generation.
2. We modified and tuned the hyperparameters and networks of two existing RL frame-
works to enable the completion of several standard robotics tasks without the use of
manual control.
3. We completed validation testing in the real world to confirm the feasibility and
potential of this approach for replacing manual task programming.
Other minor contributions include the following.
1. We created realistic simulation tasks for training and testing the application of RL for
robotic control.
2. We completed direct comparisons between PPO and SAC to review the potential of
each for task learning.
agent must learn to avoid collisions with the large placement block.
where the policy parameter θ is maximized based on the product of the ratio of new and
old policies and the advantage function Â. TRPO constrains the updates to the policy
parameter θ through the introduction of the KL divergence constraint. The trust region
constraint can be expressed as shown in the following equation [32,33],
where δ is the size of the constraint region. This constraint limits the difference between the
new policy and the old policy to prevent large, unstable updates.
The final loss function for TRPO can be expressed as shown [33],
πθ ( a|s)
Maximizeθ E Â − βKL[πθold (·|s ), πθ (·|s )] (3)
πθold ( a|s)
where e is the clipping hyperparameter (0.1–0.2). By using this objective function, the action
will be clipped in the interval [1 − e, 1 + e] [34].
PPO is a stable training technique as it constantly learns the policy in an on-policy way
through continuous exploration without the use of a replay buffer. The main disadvantage
of PPO is the low sample efficiency, and the convergence to a single deterministic policy.
An alternative to PPO is the SAC technique.
Soft Actor–Critic
SAC is an off-policy actor–critic method, founded on advantage actor–critic (A2C).
SAC was selected over A2C and DPG approaches due to its effectiveness balancing
the exploration–exploitation tradeoff, and its ease of parallelization. SAC balances the
exploration–exploitation tradeoff with entropy regularization, which encourages the agent
to explore based on the “temperature” (uncertainty) at a given time step. The formulation
for the entropy is
H( x ) = E[− log(π ( ∗| x ))], (5)
where π is the probability density function for the policy, and x is a random variable
representing the state.
SAC works by learning the policy and two value functions simultaneously. The policy
formulation is
∞
π ∗ = argmax ∑t=0 γR s, a, s0 + αH (s) ,
(6)
π
where H ( ) is the entropy of the policy at a given timestep, π ∗ is the optimal policy, γ is the
discount rate (time dependent), and α is the entropy regularization coefficient [35,36]. Here,
Robotics 2023, 12, 12 6 of 19
the entropy serves as a reward for the agent at each time step to encourage or discouraged
exploration. The formulation of the value functions V* and Q* are
∗
Qπ (s, a) = r (s, a) + Es0 V 0∗ s0
(7)
∗
h ∗ i
V π (s) = Ea∼π Qπ (s, a) − α H(s) (8)
where α is a dual variable (fixed and varying) and r is the reward given a state action
pairing [35,37]. In the case where alpha is varying, α is formulated as
2.1.5. Optuna
Due to the complex nature of RL and robotic control, hyperparameter tuning is crucial.
RL has all the same hyperparameters as required in supervised learning such as number of
epochs, batch size, choice of activation function, learning rate, and optimization algorithm.
Additionally, RL problems have a range of hyperparameters not required in supervised
learning, such as number of steps (time between updates), gamma (discount factor), and
entropy coefficient (confidence parameter, encourages exploration). During training, it
was noticed that network parameters played a significant role in the speed and stability of
convergence, so the number of hidden layers and the neural network width were treated as
hyperparameters to be optimized.
For RL problems, grid or random search would be unreasonable because >1 million
hyperparameter combinations would be required to achieve solutions close to optimal.
The intelligent hyperparameter search algorithm applied for this project was the Gaus-
sian process-based tree-structured Parzen estimator (TSPE) [39]. For the Panda robot,
the TSPE objective function contained the “reward” metric, which was maximized dur-
ing training [40]. To improve learning speed, the median-tuner pruning technique was
applied. The pruner was set to start pruning after completing 1/3 of the steps for each
hyperparameter trial.
After each Optuna trial was completed, the results of the trial were viewed in the
parallel coordinate plot (PCP) and the hyperparameter importance plot (HIP). PCPs were
parallel coordinate plot (PCP) and the hyperparameter importance plot (HIP). PCPs were
implemented as a tool for comparing hyperparameters, to learn how specific
hyperparameter ranges affected training accuracy. HIPs were implemented to determine
which hyperparameters caused the most significant impact on training results. Figure 3
Robotics 2023, 12, 12 depicts an example of PCP and HIP plots from an Optuna trial. Note that 7the of 19column
values have different number formats for different hyperparameters, i.e., for some
columns, the range is an integer (i.e., number of epochs), some are floating points (i.e.,
implemented
gamma, as a rate,
learning tool for comparing
etc.) and some hyperparameters, to learn howpowers
are integers representing specific (i.e.,
hyperpa-
batch size,
rameter ranges affected training
ranging from 128 (2 ) to 512 (2 )).
7 9 accuracy. HIPs were implemented to determine which
hyperparameters caused the most significant impact on training results. Figure 3 depicts
The PCP is a useful tool when comparing the performance of specific ranges of
an example of PCP and HIP plots from an Optuna trial. Note that the column values
hyperparameters, and the relationships between them. From the PCP shown in Figure 3,
have different number formats for different hyperparameters, i.e., for some columns, the
itrange
is clear
is anthat GAE(i.e.,
integer lambda
number in the range of
of epochs), 0.8–0.9
some with gamma
are floating values
points (i.e., in the
gamma, range of 0.9–
learning
0.95
rate,and
etc.)learning
and somerates in the range
are integers of 0.001–0.005
representing perform
powers (i.e., bestranging
batch size, for PPO for128
from vector-based
(27 )
Panda-grasp 9
to 512 (2 )). with dense rewards.
Figure 3. An example of parallel coordinate plot (PCP) and hyperparameter importance plot (HIP)
Figure 3. training
from the An example
of PPO offor
parallel coordinate
vector-based Pandaplot
grasp(PCP)
with and
densehyperparameter
rewards. importance plot (HIP)
from the training of PPO for vector-based Panda grasp with dense rewards.
The PCP is a useful tool when comparing the performance of specific ranges of
hyperparameters,
For some trials, and the therelationships between
relationship them. From
between thetheobjective
PCP shownvalue
in Figure
and3, itspecific
is clear that GAE lambda in the range of 0.8–0.9 with gamma values in the range of 0.9–0.95
hyperparameters shown in the PCP were not clear. For such cases, HIPs were
and learning rates in the range of 0.001–0.005 perform best for PPO for vector-based
implemented
Panda-grasp with to determine
dense rewards.if the hyperparameters of concern had a significant impact on
training. Hyperparameters
For some with little
trials, the relationship impact
between on training
the objective valuewere
and fixed once
specific a realistic value
hyperparame-
for the hyperparameter was found from literature or from hyperparameter
ters shown in the PCP were not clear. For such cases, HIPs were implemented to determine tuning. The
Results section presents
if the hyperparameters of figures
concernofhadPCPs. Through
a significant the use
impact of HIPs,Hyperparameters
on training. values which had little
with little impact on training were fixed once a realistic value for the hyperparameter was
found from literature or from hyperparameter tuning. The results section presents figures
of PCPs. Through the use of HIPs, values which had little effect on training performance
were identified and removed during early rounds of hyperparameter tuning.
WithForthethis
standard
project,sparse
dense reward
and sparsescheme, the agent
rewards received afor
were compared reward
reachof −1 grasp.
and for all The
states except the final placement state. The agent had difficulty solving
pick-and-place task required the agent to pick up, transport, and release the target complex RL prob-
block at
lemsthewith this reward scheme due to the Monte Carlo (random) nature of this
correct location. For this task, a standard sparse reward function was too difficult to approach.
The solve,
dense soreward function
only the resultsimplemented
for the densefor this project
reward scheme wasarethe heterogeneous reinforce-
shown.
ment function [41]. In RL
Hierarchical thiswas
approach, the reward
considered for this improved as the agent
project; however, executesproject
to constrain the taskscope,
correctly. For example,
only standard sparseforand
reach, the reward
dense rewardsimproves proportionally
were compared. to the distance
Alternative research be-
[21,42]
tween the EE andthe
investigates target sphere.
use of hierarchical RL for similar applications.
For this project, dense and sparse rewards were compared for reach and grasp. The
2.2. Experiment
pick-and-place taskDesign
requiredandtheRobotic
agentControl
to pick up, transport, and release the target block
at the correct location.
The robot For this task,
implemented for athis
standard
projectsparse
is the reward function Robot
Panda Research was too difficult by
developed
to solve,
Frankaso Emika.
only theThe results
Panda forrobot
the dense reward scheme
was purchased are shown.
as a packaged system which contained the
Hierarchical
arm, RL was
the gripper, considered
the control box,for this project; however,
communication controller,to joint
constrain
lock,project scope,
and a kill switch,
only standard
which
Robotics 2023, 12, x FOR PEER REVIEW all sparse
can be and
vieweddense
in rewards
Figure 5. were compared.
Additional widgetsAlternative
added research
include the [21,42]
Azure in-
Kinect
9 of 20
(for remote
vestigates the use viewing and control)
of hierarchical andsimilar
RL for the workstation
applications.display monitor. The Panda system
was selected for the project, as it is a research tool which allows for quick implementation
2.2. Experiment
and testingDesign and Robotic
for various controlControl
strategies [43].
The robot implemented for this project is the Panda Research Robot developed by
Franka Emika. The Panda robot was purchased as a packaged system which contained
the arm, the gripper, the control box, communication controller, joint lock, and a kill
switch, which all can be viewed in Figure 5. Additional widgets added include the Azure
Kinect (for remote viewing and control) and the workstation display monitor. The Panda
system was selected for the project, as it is a research tool which allows for quick imple-
mentation and testing for various control strategies [43].
Figure
Figure 5. Panda
5. Panda robot
robot lablab setup.
setup.
Figure
Figure 6.
6. The
The FCI
FCI communication
communication framework.
framework.
The FCI allows for 1 kHz signals to be communicated between the workstation PC
and the Panda robot. To enable the use of this high-frequency signal communication, the
Ubuntu workstation requires a real-time Linux Kernel (5.14.2-rt21).
Figure 7. Framework for control communication between the PyBullet simulator, the real-world
Figure 7. Framework for control communication between the PyBullet simulator, the real-world
robot, the ROS system and the RL agent.
robot, the ROS system and the RL agent.3. Results and Discussion.
Both simulation and real-world testing were completed for this project. Section 3.1
depicts the results from hyperparameter tuning and simulation training. Section 3.2 shows
the accuracy of real-world testing after simulation training.
Figure 8. Left: PCP for PPO Panda grasp with dense rewards. Agent receives the highest objective
Figure
value8.with
Left: PCP for
a batch sizePPO
of 28Panda grasp
and three with denseRight:
environments. rewards.
PCPAgent
for SACreceives the highest
Panda grasp objective
with dense
value with Agent
rewards. a batchreceives 28 and
size ofthe three
highest environments.
objective value withRight:
a batchPCP
size for
of 2SAC Panda
8 or 210 and agrasp with
learning dense
rate
rewards. Agent receives the highest objective value with a batch size of 28 or 210 and a learning rate
of ~0.0075.
of ~0.0075.
As shown in Figure 9, Panda reach was trained efficiently for both PPO and SAC.
TheAsPPO model
shown intook 2.9 million-time
Figure 9, Panda reach steps to converge
was to a reach accuracy
trained efficiently for both greater
PPO than
Robotics 2023, 12, x FOR PEER REVIEW 12 and
of 20 SAC.
− 0.75. The SAC model converged very quickly, breaking the average
The PPO model took 2.9 million-time steps to converge to a reach accuracy greater reward of − 0.75 afterthan
only 0.04 million steps. With this reward range, both the PPO and SAC agents
−0.75. The SAC model converged very quickly, breaking the average reward of −0.75 after had a 100%
success rate on the reach task.
only 0.04 million steps. With this reward range, both the PPO and SAC agents had a 100%
success rate on the reach task.
Figure 9. Left: PPO convergence for Panda reach with dense rewards. Right: SAC convergence for
Figure 9. Left: PPO convergence for Panda reach with dense rewards. Right: SAC convergence for
Panda reach with dense rewards.
Panda reach with dense rewards.
AsAsshown
shownininFigure
Figure10.
10.Panda
Panda grasp
grasp was
wastrained efficiently
trained efficientlyforfor
both PPO
both PPOand
andSAC.
SAC.
The PPO model took 4.0 million-time steps to converge to an average accuracy
The PPO model took 4.0 million-time steps to converge to an average accuracy greater than greater
than
−0.5.−0.5.
TheThe SACSAC model
model converged
converged veryvery quickly,
quickly, breaking
breaking the average
the average reward
reward of −of
0.5−0.5
after
after only 0.08 million steps. With these rewards, PPO was able to successfully
only 0.08 million steps. With these rewards, PPO was able to successfully complete the task complete
the
fortask
89%forof 89% of the attempts,
the attempts, compared
compared to the
to the SAC SACwhich
agent agent waswhich
ablewasto able to complete
complete the task
the
fortask
92%forof 92% of the attempts.
the attempts.
As shown in Figure 10. Panda grasp was trained efficiently for both PPO and SAC.
The PPO model took 4.0 million-time steps to converge to an average accuracy greater
than −0.5. The SAC model converged very quickly, breaking the average reward of −0.5
after only 0.08 million steps. With these rewards, PPO was able to successfully complete
the task for 89% of the attempts, compared to the SAC agent which was able to complete
Robotics 2023, 12, 12 12 of 19
the task for 92% of the attempts.
Figure
Robotics 2023, 12, x FOR PEER 10.Figure
REVIEW Left: PPO convergence
10. Left: for Pandafor
PPO convergence grasp with
Panda dense
grasp rewards.
with Right: SAC
dense rewards. convergence
Right: for
SAC convergence for13 of 20
Panda grasp with dense rewards.
Panda grasp with dense rewards.
3.1.2.Reach
3.1.2. Panda Pandaand
Reach andGrasp
Panda Pandawith
Grasp withRewards
Sparse Sparse Rewards
This reviews
This section section the
reviews
results theforresults for the
the Panda Panda
reach and reach
grasp and
tasksgrasp
with tasks
sparsewith
re- sparse
reward
ward functions.Asfunctions.
shown
Like inLike
section Figure
3.1.1., 12,
SectionthisPanda
3.1.1, reach
this
problem problemwas required
required trained
the Panda efficiently
thereach
Panda fortoboth
reach
agent agent
learn PPO and SAC.
to learn
The
the PPO model
relationship took
between5.6 million-time
the reward andsteps
the to
EE converge
XYZ to an
coordinates,
the relationship between the reward and the EE XYZ coordinates, and the target XYZ co- accuracy
and greater
the target than
XYZ −2.75.
coordinates.
TheThe
ordinates. SAC The difference
model between
difference converged between
thisquickly, this Section,
Section, breaking
and Section and Section
the3.1.1
average 3.1.1 is
is the reward the
reward thereward
of −1.9 the agent
agentafter only 0.16
receives
receivesmillion duringWith
during exploration.
steps. exploration.
For
thisthe For the
sparse
reward sparse
reward
range, reward
and scheme,
scheme,
PPO the
SAC agent the agent
had receives
receives
agents a reward
100% a of
reward
success of
rates.
−1 after
−1 after each step,each step,
unless unless
the agent the agent
reaches reaches
the target the target
position position
and and
correctly correctly
completes
As can be seen in Figure 13., Panda grasp was trained efficiently for both PPO and completes
the the
task.
task. SAC. The PPO model took 5.5 million-time steps to converge to an average accuracy
Several Several
sets sets of hyperparameter
of hyperparameter studies studies
were were completed
completed for these for these
tasks tasks 11).
(Figure (Figure 11).
greater than −2.75. The SAC model converged quickly, breaking the average reward of
Compared to the same tasks with sparse rewards, the networks
Compared to the same tasks with sparse rewards, the networks for these tasks had to be for these tasks had to be
−3.1
deeperafter only
andThe 0.14
wider. million
The steps. With
grasp hyperparameters this reward
performed range,
well the PPO and
when applied SAC
to the reach were
agents
deeper and wider. grasp hyperparameters performed well when applied to the reach
able
task,tosocomplete
only one theoftask
set with 90% and
hyperparameter 95% was
tuning success rates, respectively.
required.
task, so only one set of hyperparameter tuning was required.
Figure 11. Left: PCP for PPO Panda grasp with sparse rewards. Agent receives the highest objective
Figure 11. Left:
value with PCP rate
a learning for PPO
~0001,Panda grasp with sparse
two environments, and a rewards. Agentofreceives
network depth theRight:
five layers. highest objective
PCP
value
for SAC Panda grasp with sparse rewards. Agent receives the highest objective value with an entropy PCP
with a learning rate ~0001, two environments, and a network depth of five layers. Right:
for SAC Panda
coefficient ~0.002,grasp
batchwith sparse
size of rewards.
210 , learning rateAgent receives
~0.0015, the highest
and gamma objective value with an en-
of ~0.96.
tropy coefficient ~0.002, batch size of 2 , learning rate ~0.0015, and gamma of ~0.96.
10
As shown in Figure 12, Panda reach was trained efficiently for both PPO and SAC. The
PPO model took 5.6 million-time steps to converge to an accuracy greater than −2.75. The
SAC model converged quickly, breaking the average reward of −1.9 after only 0.16 million
steps. With this reward range, PPO and SAC agents had 100% success rates.
Figure 11. Left: PCP for PPO Panda grasp with sparse rewards. Agent receives the highest objective
value with a learning rate ~0001, two environments, and a network depth of five layers. Right: PCP
Robotics 2023, 12, 12 for SAC Panda grasp with sparse rewards. Agent receives the highest objective value 13 with
of 19an en-
tropy coefficient ~0.002, batch size of 2 , learning rate ~0.0015, and gamma of ~0.96.
10
Figure 12. Left: PPO convergence for Panda reach with sparse rewards. Right: SAC convergence for
Figure 12. Left: PPO convergence for Panda reach with sparse rewards. Right: SAC convergence
Panda reach with sparse rewards.
for Panda reach with sparse rewards.
As can be seen in Figure 13, Panda grasp was trained efficiently for both PPO and SAC.
Robotics 2023, 12, x FOR PEER REVIEW 14 of 20
The PPO model took 5.5 million-time steps to converge to an average accuracy greater than
−2.75. The SAC model converged quickly, breaking the average reward of −3.1 after only
0.14 million steps. With this reward range, the PPO and SAC agents were able to complete
the task with 90% and 95% success rates, respectively.
Figure 13. Left: PPO convergence for Panda grasp with sparse rewards. Right: SAC convergence for
Figure 13. Left: PPO convergence for Panda grasp with sparse rewards. Right: SAC convergence
Panda grasp grasp
for Panda with sparse rewards.
with sparse rewards.
The contrast between sparse and the dense rewards can be understood by comparing
The contrast
this section between
with Section 3.1.1.sparse and the
With sparse densefunctions,
reward rewards canthe be understood
agent required by comparing
approxi-
this section
mately twice aswith
manySection
training3.1.1. With
steps, sparseadditional
because reward functions,
exploration thewas
agent required
required approxi-
to find
mately twice as many training
the states resulting in positive rewards. steps, because additional exploration was required to find
theThe
states resulting
agent in positive
consistently receivedrewards.
a high negative reward with the dense scheme (range
of −1.5The
to −agent consistently
3), because received
the reward is −1aper
high negative
step reward
rather than beingwith the on
based dense scheme
position. For(range
ofreach
the −1.5 to −3),both
task, because
sparsethe andreward
dense is −1 per schemes
rewards step rather than in
resulted being basedcompleting
the agent on position. For
the
the reach
task withtask,
a 100%both sparserate.
success andFordense
the rewards
grasp task,schemes resultedwith
agents trained in the agent
sparse completing
rewards
outperformed
the task withagents
a 100%trained with
success dense
rate. Forrewards
the graspby task,
an average
agentsoftrained
2%. with sparse rewards
These resultsagents
outperformed indicate that for
trained withproblems with minimal
dense rewards difficulty,
by an average of a2%.
simple, sparse,
deterministic reward signal is most effective. The main disadvantage
These results indicate that for problems with minimal difficulty, of the sparse rewardsparse,
a simple,
deterministic reward signal is most effective. The main disadvantage of the sparse reward
scheme is that significant exploration and clock-time are required for the policy to obtain
convergence.
scheme is that significant exploration and clock-time are required for the policy to obtain
convergence.
Figure 14. Left: Parallel coordinate plot (PCP) for PPO Panda pick-and-place with dense rewards.
Figure
Agent14. Left: the
receives Parallel coordinate
highest plot (PCP)
objective value for PPO
with batch sizesPanda pick-and-place
of 210 or with
211 , learning rate dense rewards.
of ~0.00075,
Agent receives the highest objective value with batch sizes of 2 10 or 211, learning rate of ~0.00075, and
and a network depth of seven layers. The batch size is shown. Right: Parallel coordinate plot (PCP)
a network depthpick-and-place
for SAC panda of seven layers.
withThe batch
dense size isAgent
rewards. shown. Right:
receives theParallel coordinate
highest objective plot
value (PCP) for
with
SAC panda
batch pick-and-place
sizes of 2 , 10 learningwith
10 starts,dense rewards.
and five Agent receives the highest objective value with
environments.
batch sizes of 210, 10 learning starts, and five environments.
As can be seen in Figure 15, Panda pick-and-place was trained efficiently with both
PPO and SAC. The PPO model took 8.3 million-time steps to converge to a reach reward
greater than −7.0 and the SAC model took 0.47 million-time steps to converge to a reach
reward greater than −7.0. With this reward range, the PPO and SAC agents could complete
the task with 85% and 71% success rates, respectively.
The use of sparse rewards means that there are many suboptimal solutions for this
problem. One suboptimal solution involves pushing the target object toward the target
position while keeping the block on the table. This solution increases the reward, because
the distance between the target object and the target position is decreased; however, this
solution does not involve grasping and lifting the block to complete the task. Both the SAC
and PPO solutions have a rolling reward convergence, due to the agent first learning these
suboptimal solutions.
Figure 15. Left: PPO convergence for Panda pick-and-place with dense rewards. Right: SAC con-
vergence for Panda pick-and-place with dense rewards.
The use of sparse rewards means that there are many suboptimal solutions for this
Figure 14. Left: Parallel coordinate plot (PCP) for PPO Panda pick-and-place with dense rewards.
Agent receives the highest objective value with batch sizes of 210 or 211, learning rate of ~0.00075, and
a network depth of seven layers. The batch size is shown. Right: Parallel coordinate plot (PCP) for
SAC panda pick-and-place with dense rewards. Agent receives the highest objective value with
Robotics 2023, 12, 12 batch sizes of 210, 10 learning starts, and five environments. 15 of 19
Figure15.
Figure 15. Left:
Left: PPO
PPO convergence
convergence for
for Panda
Panda pick-and-place
pick-and-placewith
withdense
denserewards.
rewards. Right:
Right: SAC
SAC con-
vergence for Panda pick-and-place with dense rewards.
convergence for Panda pick-and-place with dense rewards.
3.1.4. The
Summary
use of sparse rewards means that there are many suboptimal solutions for this
Given
problem. One sufficient hyperparameter
suboptimal tuning and
solution involves reward
pushing shaping,
the both PPO
target object and the
toward SACtarget
agents
positionwere ablekeeping
while to effectively learn on
the block optimal control
the table. policies
This for reach,
solution grasp,
increases theand pick-and-
reward, because
place. For relatively simple reach and grasp tasks, sparse and dense rewards
the distance between the target object and the target position is decreased; however, this performed
well. Thedoes
solution mainnotconsequence of using
involve grasping and sparse rewards
lifting is the
the block extensivethe
to complete hyperparameter
task. Both the SAC
tuning required and the increase in training time. Due to the task complexity for pick-and-
and PPO solutions have a rolling reward convergence, due to the agent first learning these
place, the dense reward scheme was implemented. After extensive training, the agent was
suboptimal solutions.
able to complete the task with an average accuracy of 78%. Table 3 shows a summary of
all the simulation results. The results for pick-and-place were noticeably lower than some
3.1.4. Summary
of the results from the literature. The primary rationale for this difference is the obstacle
Given
avoidance sufficient
which must behyperparameter tuning and reward
learned in this environment. shaping, both
The pick-and-place task PPO and SAC
simulated
agents
here were
does not able
only to effectively
require learn
the agent to optimal
move thecontrol policies
target block to afor reach, in
position grasp,
space,and
butpick-
and-place.
also requiresFor
the relatively simple
agent to avoid the reach and block
placement graspwhile
tasks,completing
sparse and dense
this rewards
motion. Most per-
of the failures
formed well.noticed
The in simulation
main were dueof
consequence to interference
using sparse between the gripper
rewards is theandextensive
the
large target block. Although this change makes the task significantly more difficult, the
results are more realistic for application in the real world.
Positional
RL Task Success
Simulated Problem Reward Feedback
Implementation Rate (%)
Method
Panda reach Dense Vector PPO 100
Panda reach Dense Vector SAC 100
Panda reach Sparse Vector PPO 100
Panda reach Sparse Vector SAC 100
Panda grasp Dense Vector PPO 89
Panda grasp Dense Vector SAC 92
Panda grasp Sparse Vector PPO 90
Panda grasp Sparse Vector SAC 95
Panda pick-and-place Dense Vector PPO 85
Panda pick-and-place Dense Vector SAC 71
The comparison of PPO and SAC reveals several patterns in training time, perfor-
mance, and convergence. For all problems, SAC required a minimum of one order of mag-
nitude fewer steps than PPO to obtain convergence. The SAC advantage in training time
stands in stark contrast to SACs convergence difficulties. SAC was highly hyperparameter-
sensitive compared to PPO, which resulted in additional time being spent to determine the
Robotics 2023, 12, 12 16 of 19
ideal hyperparameters for each task. For the simple reach and grasp problems SAC con-
verged to a more optimal solution than PPO; however, for the more difficult pick-and-place
problem, PPO significantly outperformed SAC.
The results from the comparison of SAC and PPO are intuitive. SAC implements
entropy maximization and off-policy training to reduce training time. The consequence
of these training principles is that SAC is sample-efficient but tends to converge to a local
optimum. PPO maintained slow, consistent convergence and SAC tended to diverge when
overtrained.
3.2. Real-World RL
After all simulated RL tasks were developed, tuned, and trained, the networks were
tested in the real world. For each implementation, 10 tests were completed to approximate
the real-world testing accuracy. Due to the stochastic nature of the PPO and SAC policies,
the planned path and final grasp position were different for each attempted grasp. For
each grasp position, two grasp attempts were made. This testing serves to validate the
accuracy of the PyBullet digital twin, and the potential for this methodology for real-world
RL implementation. Some of the sample real and simulated test grasps and pick-and-place
actions can be viewed in the Supplementary Material.
During testing, the Panda agent was instantiated in PyBullet, and each incremental
step the agent took in the simulation space was replicated in the real world. After each
action step was executed in PyBullet, the agent waited to take a new step until the real-
world arm moved to match its digital twin. During each step, the Franka-ROS feedback
loop ensured the positional accuracy of the end effector. The incremental stepping approach
was implemented to slow down the real-world testing to prevent damage to the robot
during collisions with the mounting table or any objects in the robot’s vicinity.
Table 4 depicts the real-world performance of the RL agent for each task. As shown,
the reach and grasp tasks were completed with relatively high accuracy. Two minor issues
that caused a reduction in task completion rate during testing were (1) minor difference in
geometry of the tested object vs. the simulated object, and (2) the calibration of the physics
environment.
Positional
RL Task Success
Real Problem Reward Feedback
Implementation Rate
Method
Panda reach Dense Vector PPO 90
Panda reach Dense Vector SAC 90
Panda grasp Dense Vector PPO 70
Panda grasp Dense Vector SAC 80
Panda pick-and-place Dense Vector PPO 70
Panda pick-and-place Dense Vector SAC 60
Future work that could improve real world testing includes the following. First, the
simulated or real-world target could be modified so both target objects match. This change
is relatively small but will prevent geometry differences from causing failure. Second, the
simulation environment can be upgraded to the recently open-sourced simulation package
MuJoCo. Due to the higher accuracy of this environment, and the low computational cost,
this change would improve sim-to-real transfer while not increasing training time. Finally,
positional sensing could be applied to the real-world target block to ensure that the real
and simulated target positions match during each trained task.
4. Conclusions
Considerable progress has been made in this project toward the goal of creating simu-
lated and real-world autonomous robotic agents capable of performing tasks such as reach,
Robotics 2023, 12, 12 17 of 19
grasp, and pick-and-place. To achieve this goal, custom representative simulation environ-
ments were created, a combined RL and traditional control methodology was developed, a
custom tuning pipeline was implemented, and real-world testing was completed.
Through the extensive tuning of the RL algorithms; SAC and PPO, optimal hyperpa-
rameter combinations and network designs were found. The results of this training were
implemented to complete a comparison between PPO and SAC for robotic control. The
comparison indicates that PPO performs best when the task is complex (involves object
avoidance) and time is readily available. SAC performs best when the task is simple and
time is limited.
After optimal SAC and PPO hyperparameters were found, SAC and PPO algorithms
were connected with the Libfranka, Franka-ROS, and MoveIt control packages to test the
connection between the simulated PyBullet agent and the real-world Panda robot. Real-
world testing was conducted to validate the novel communication framework developed
for the simulated and real environments, and to verify that real-world policy transference
is possible.
During real-world testing, the accuracy of the reach, grasp, and pick-and-place tasks
were reduced by 10–20% compared to the simulation environment. This result provides
and optimistic indication of the future applicability of this method, and also indicates that
further calibration of the simulation environment and modifications to the target object
are required.
References
1. Massa, D.; Callegari, M.; Cristalli, C. Manual Guidance for Industrial Robot Programming; Emerald Group Publishing Limited:
Bingley, UK, 2015; pp. 457–465. [CrossRef]
2. Biggs, G.; Macdonald, B. A Survey of Robot Programming Systems; Society of Robots: Brisbane, Australia, 2003; p. 27.
3. Saha, S.K. Introduction to Robotics, 2nd ed.; McGraw Hill Education: New Delhi, India, 2014.
4. Craig, J. Introduction to Robotics Mechanics and Control; Pearson Education International: Upper Saddle River, NJ, USA, 2005.
5. Al-Selwi, H.F.; Aziz, A.A.; Abas, F.S.; Zyada, Z. Reinforcement Learning for Robotic Applications with Vision Feedback. In
Proceedings of the 2021 IEEE 17th International Colloquium on Signal Processing & Its Applications (CSPA), Langkawi, Malaysia,
5–6 March 2021.
6. Tai, L.; Zhang, J.; Liu, M.; Boedecker, J.; Burgard, W. A Survey of Deep Network Solutions for Learning Control in Robotics: From
Reinforcement to Imitation. arXiv 2016, arXiv:1612.07139.
7. Kober, J.; Bagnell, A.; Peters, J. Reinforcement Learning in Robotics: A Survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [CrossRef]
8. Liu, D.; Wang, Z.; Lu, B.; Cong, M.; Yu, H.; Zou, Q. A Reinforcement Learning-Based Framework for Robot Manipulation Skill
Acquisition. IEEE Access 2020, 8, 108429–108437. [CrossRef]
9. Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al.
Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. In Proceedings of the 2nd Conference on Robot
Learning, Zürich, Switzerland, 29 October 2018.
10. Mohammed, M.Q.; Chung, K.L.; Chyi, C.S. Pick and Place Objects in a Cluttered Scene Using Deep Reinforcement Learning. Int.
J. Mech. Mechatron. Eng. 2020, 20, 50–57.
Robotics 2023, 12, 12 18 of 19
11. Liu, R.; Nageotte, F.; Zanne, P.; de Mathelin, M.; Drespp-Langley, B. Deep Reinforcement Learning for the Control of Robotic
Manipulation: A Focussed Mini-Review. arXiv 2021, arXiv:2102.04148.
12. Kleeberger, K.; Bormann, R.; Kraus, W.; Huber, M. A Survey on Learning-Based Robotic Grasping. Curr. Robot. Rep. 2020, 1,
239–249. [CrossRef]
13. Xiao, Y.; Katt, S.; ten Pas, A.; Chen, S.; Amato, C. Online Planning for Target Object Search in Clutter under Partial Observability.
In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019.
14. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA; London, UK, 2018.
15. Russell, S.; Norvig, P. Artificial Intelligence A Modern Approach, 4th ed.; Pearson Education, Inc.: Hoboken, NJ, USA;
ISBN 978-0-13-461099-3.
16. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal
Process. Magazine 2017, 34, 26–38. [CrossRef]
17. Ng, A.; Harada, D.; Russell, S. Policy invariance under reward transformations theory and application to reward shaping. In
Proceedings of the Sixteenth International Conference on Machine Learning, San Francisco, CA, USA, 27 June 1999; pp. 278–287.
18. Gualtieri, M.; Pas, A.; Platt, R. Pick and Place Without Geometric Object Models; IEEE: Brisbane, QLD, Australia, 2018; pp. 7433–7440.
19. Gualtieri, M.; Platt, R. Learning 6-DoF Grasping and Pick-Place Using Attention Focus. arXiv 2018, arXiv:1806.06134.
20. Pore, A.; Aragon-Camarasa, G. On Simple Reactive Neural Networks for Behaviour-Based Reinforcement Learning. In Proceed-
ings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020.
21. Li, B.; Lu, T.; Li, J.; Lu, N.; Cai, Y.; Wang, S. ACDER: Augmented Curiosity-Driven Experience Replay. In Proceedings of the 2020
IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 21 August 2020; pp. 4218–4224.
22. Marzari, L.; Pore, A.; Dall’Alba, D.; Aragon-Camarasa, G.; Farinelli, A.; Fiorini, P. Towards Hierarchical Task Decomposition
Using Deep Reinforcement Learning for Pick and Place Subtasks. arXiv 2021, arXiv:2102.04022.
23. Pedersen, M.; Nalpantidis, L.; Andersen, R.; Schou, C.; Bøgh, S.; Krüger, V.; Madsen, O. Robot skills for manufacturing: From
concept to industrial deployment. Robot. Comput.-Integr. Manuf. 2016, 37, 282–291. [CrossRef]
24. Lobbezoo, A.; Qian, Y.; Kwon, H.-J. Reinforcement Learning for Pick and Place Operations in Robotics: A Survey. Robotics 2021,
10, 105. [CrossRef]
25. Mohammed, M.; Kwek, L.; Chua, S. Review of Deep Reinforcement Learning-Based Object Grasping: Techniques, Open
Challenges, and Recommendations. IEEE Access 2020, 8, 178450–178481. [CrossRef]
26. Howard, A. Gazebo. Available online: http://gazebosim.org/ (accessed on 20 September 2022).
27. Erez, T.; Tassa, Y.; Todorov, E. Simulation Tools for Model-Based Robotics: Comparison of Bullet, Havok, MuJoCo, ODE and
PhysX. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30
May 2015; pp. 4397–4404.
28. DeepMind Opening Up a Physics Simulator for Robotics. Available online: https://www.deepmind.com/blog/opening-up-a-
physics-simulator-for-robotics (accessed on 11 July 2022).
29. Coumans, E. Tiny Differentiable Simulator. Available online: https://pybullet.org/wordpress/ (accessed on 10 June 2022).
30. Gallouédec, Q.; Cazin, N.; Dellandréa, E.; Chen, L. Multi-Goal Reinforcement Learning Enviroments for Simulated Franka Emika
Panda Robot. arXiv 2021, arXiv:2106.13687.
31. Shahid, A.A.; Piga, D.; Braghin, F.; Roveda, L. Continuous Control Actions Learning and Adaptation for Robotic Manipulation
through Reinforcement Learning. Autonomous Robots 2022, 46, 483–498. [CrossRef]
32. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017,
arXiv:1707.06347.
33. Karagiannakos, S. Trust Region and Proximal Policy Optimization (TRPO and PPO). Available online: https://theaisummer.
com/TRPO_PPO/ (accessed on 13 December 2021).
34. Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. arXiv 2015, arXiv:1502.05477.
35. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a
Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018.
36. Tuomas, H.; Zhou, A.; Hartikainen, K.; Tucker, G. Soft Actor-Critic Algorithms and Applications. arXiv 2019, arXiv:1812.05905v2.
37. Haarnoja, T.; Ha, S.; Zhou, A.; Tan, J.; Tucker, G.; Levine, S. Learning To Walk via Deep Reinforcement Learning. arXiv 2019,
arXiv:1812.11103.
38. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning
Implementations. J. Mach. Learn. Res. 2021, 22, 1–8.
39. Bergstra, J.; Bardenet, R.; Bengio, Y.; Kegl, B. Algorithms for Hyper-Parameter Optimization; Curran Associates Inc.: Granada, Spain,
2011; pp. 2546–2554.
40. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In
Proceedings of the Applied Data Science Track Paper, Anchorage, AK, USA, 4 August 2019.
41. Mataric, M.J. Reward functions for accelerated learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The
Netherlands, 1994; pp. 181–189.
42. Anca, M.; Studley, M. Twin Delayed Hierarchical Actor-Critic. In Proceedings of the 2021 7th International Conference on
Automation, Robotics and Applications (ICARA), Prague, Czech Republic, 4–6 February 2021.
Robotics 2023, 12, 12 19 of 19
43. Franka Emika. Data Sheet Robot—Arm & Control. Available online: https://pkj-robotics.dk/wp-content/uploads/2020/09/
Franka-Emika_Brochure_EN_April20_PKJ.pdf (accessed on 13 July 2021).
44. Görner, M.; Haschk, R.; Ritter, H.; Zhang, J. MoveIt! Task Constructor for Task-Level Motion Planning. In Proceedings of the 2019
International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019.
45. Coumans, E.; Bai, Y. PyBullet Quickstart Guide. Available online: https://docs.google.com/document/d/10sXEhzFRSnvFcl3
XxNGhnD4N2SedqwdAvK3dsihxVUA/edit#heading=h.2ye70wns7io3 (accessed on 12 March 2022).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.