A Survey On Deep Reinforcement Learning Algorithms For Robotic Manipulation
A Survey On Deep Reinforcement Learning Algorithms For Robotic Manipulation
Review
A Survey on Deep Reinforcement Learning Algorithms for
Robotic Manipulation
Dong Han 1 , Beni Mulyana 1 , Vladimir Stankovic 2 and Samuel Cheng 1, *
1 School of Electrical and Computer Engineering, University of Oklahoma, Norman, OK 73019, USA
2 Department of Electronic and Electrical Engineering, University of Straclyde, Glasglow G1 1XW, UK
* Correspondence: [email protected]; Tel.: +1-918-401-0409
Abstract: Robotic manipulation challenges, such as grasping and object manipulation, have been
tackled successfully with the help of deep reinforcement learning systems. We give an overview
of the recent advances in deep reinforcement learning algorithms for robotic manipulation tasks in
this review. We begin by outlining the fundamental ideas of reinforcement learning and the parts
of a reinforcement learning system. The many deep reinforcement learning algorithms, such as
value-based methods, policy-based methods, and actor–critic approaches, that have been suggested
for robotic manipulation tasks are then covered. We also examine the numerous issues that have
arisen when applying these algorithms to robotics tasks, as well as the various solutions that have
been put forth to deal with these issues. Finally, we highlight several unsolved research issues and
talk about possible future directions for the subject.
1. Introduction
Industry 4.0’s embrace of artificial intelligence (AI) has drastically changed the in-
dustrial sector by allowing machines to operate autonomously, increasing productivity,
lowering costs, and improving product quality [1]. In order to make decisions quickly
and efficiently during the manufacturing process, artificial intelligence (AI) technology
Citation: Han, D.; Mulyana, B.;
analyzes enormous amounts of data produced by machines and processes. In the next five
Stankovic, V.; Cheng, S. A Survey on
years, its total market value is expected to triple [2]. Robotics is one of the markets that is
Deep Reinforcement Learning
anticipated to expand at the fastest rates. With the concept of robot manipulation proposed
Algorithms for Robotic Manipulation.
Sensors 2023, 23, 3762. https://
in 1962 [3], the idea of a robot is to mimic human behavior and tackle complex tasks. A
doi.org/10.3390/s23073762
branch of robotics called robotic manipulation is focused on developing robots that can
manipulate items in their environment. Robotic manipulation seeks to develop robotic
Academic Editor: Alberto Borboni systems that can carry out a variety of activities that call for the manipulation of things,
Received: 7 March 2023 such as putting together goods in a factory, picking and placing items in a warehouse, or
Revised: 26 March 2023 performing surgery in a hospital. Robotic manipulation systems typically consist of a robot
Accepted: 3 April 2023 arm, a gripper or end-effector, and a control system that coordinates the movement of
Published: 5 April 2023 the robot arm and gripper. The gripper is in charge of grabbing and manipulating items,
while the robot arm is in charge of transferring them to the target area in the environment.
Figure 1 shows a classic robotic manipulation workflow. Moreover, Matt Mason provided
a thorough and in-depth description of manipulation in the introduction of his 2018 re-
Copyright: © 2023 by the authors. view paper [4]. Robotic manipulation is used in various fields such as manufacturing,
Licensee MDPI, Basel, Switzerland. agriculture, healthcare, logistics, space exploration, education, and research. It entails
This article is an open access article using robots to carry out activities including assembling, planting, harvesting, operating,
distributed under the terms and
managing goods, and performing experiments. In a variety of tasks, robotic manipulation
conditions of the Creative Commons
can boost productivity, cut human costs, increase accuracy, and enhance safety. In the
Attribution (CC BY) license (https://
upcoming years, its application is anticipated to increase across a range of industries. In the
creativecommons.org/licenses/by/
manufacturing sector, robots are used for tasks including component assembly, welding,
4.0/).
painting, and packaging. They have the ability to work carefully and diligently, increasing
productivity and decreasing costs. In the healthcare sector, robots may assist with tasks
including surgery, rehabilitation, and geriatric care. They can support healthcare personnel
by letting them focus on more challenging tasks while the robot handles the basic tasks.
In space exploration, robots are used to complete activities including sample gathering,
structure construction, and equipment repair. They can operate in environments that are
too dangerous or challenging for people, such as deep space or the ocean floor. However,
the interaction between robotic manipulator arms and objects designed for humans re-
mains a challenge. Up to now, no robots can easily achieve intelligent operations such as
handwashing dishes, peeling a pineapple, or rearranging furniture.
utilized to teach agents to recognize objects [6]. As RL algorithms are capable of learning
from experience and adapting to shifting settings, they provide a promising solution to a
wide range of difficult issues in several industries.
In this survey, we will examine the key concepts and algorithms that have been
developed for DRL in the context of robotic manipulation. This will include a review of
techniques for reward engineering, such as imitation learning and curriculum learning,
as well as approaches to hierarchical reinforcement learning. We will also discuss the
various network architectures that have been used in DRL for robotic manipulation and
the challenges associated with transferring learned policies from simulation to the real
world. Finally, we will review both value-based and policy-based DRL algorithms and
their relative strengths and limitations for robotic manipulation tasks. The contributions of
the paper are:
• A tutorial of the current RL algorithms and reward engineering methods used in
robotic manipulation.
• An analysis of the current status and application of RL in robotic manipulation in the
past seven years.
• An overview of the current main trends and directions in the use of RL in robotic
manipulation tasks.
The rest of the paper is organized as follows. The methodology used to find and choose
relevant publications is described in Section 2. In Section 3, we introduce the key RL concepts
and state-of-the-art algorithms. Next, Section 4 continues by describing the learning methods
for DRL. In Section 5, we discuss the current neural network architectures in RL. In Section 6,
we take a deep dive into the applications and implementations of robotic manipulation. Then,
we describe the existing challenges and future directions with respect to previous work. The
final paragraph of Section 7 provides a summary of the knowledge obtained.
2. Search Methodology
Since RL is more adaptable in highly dynamic and unstructured environments than
more traditional or other AI control approaches, there has been a recent increase in interest
in using RL to operate robotic manipulators [14]. These techniques have demonstrated
impressive results in enabling robots to learn complex tasks and operate in dynamic
environments. Moreover, the growing interest in robotic manipulation in reinforcement
learning has been stimulated by the expanding availability of affordable and efficient
robotic hardware, as well as the rising demand for automation across a variety of industries.
Applications of this technology include manufacturing, logistics, healthcare, and home
automation, among others. So, the purpose of this review is to present an overview of the
major works using RL in robotic manipulation tasks and to analyze the future directions of
this topic. In order to achieve this goal, a thorough review of the literature is conducted,
and the content of more than 150 articles in relevant fields is searched and reviewed.
Given the enormous amount of literature on the subject, looking for papers on rein-
forcement learning for robotic manipulation can be a difficult task. Start by identifying the
relevant keywords for the search, such as “reinforcement learning”, “robotic manipulation”,
“manipulation tasks”, “control policies”, “deep learning”, etc. These keywords will help
narrow down the search to the most relevant papers. Use a specialized search engine such
as Google Scholar, IEEE Xplore, or ArXiv to search for papers related to reinforcement
learning for robotic manipulation. These search engines allow for filtering the results from
2015 to 2022. This period’s start was chosen because of the release of the RL and the deep
neural network using the AlphaGo program [15]. This innovation has made a significant
contribution to the rapid growth of RL.
Meanwhile, studies that were not appropriate for the scope of this review had to be
excluded even though they were relevant to the field of RL. Studies that are outdated or do
not contribute to the current state of the field should be excluded. Additionally, the authors
decided to exclude papers that are poorly written or that have significant methodological
flaws. An overview of the specified search criteria can be found in Table 1.
Sensors 2023, 23, 3762 4 of 35
Criteria Description
“reinforcement learning” AND “robotic manipulation” AND “manipulation
Keywords
tasks” AND “control policies”
Search engine Google Scholar, IEEE Xplore, or ArXiv
Time period Between 2015 and the present
Publication type Academic conference paper and journal articles
Relevance Exclude studies that are not appropriate for the scope of the review
Outdated Exclude old papers that are no longer relevant to the current state of the field
Quality Exclude poorly written or methodologically flawed papers
An autonomous agent observes the state s(t) at a time step t and then interacts with
the environment using an action a(t), reaching the next state s(t + 1) in the process. Once a
new state has been achieved, the agent receives a reward correlated with that state r (t + 1).
The agent’s goal is to find an optimal policy, i.e., the optimal action in any given state.
Unlike other types of machine learning—such as supervised and unsupervised learning—
reinforcement learning can only be thought about sequentially in terms of state-action pairs
that appear one after the other.
RL assesses actions by the outcomes, i.e., the states, they achieve. It is goal-oriented
and seeks to learn sequences of actions that will lead an agent to accomplish its goal or
optimize its objective function. An example of the RL objective function is:
t=∞
∑ γt r (s(t), a(t)) (1)
t =0
This objective function measures all of the rewards that we will receive from running
through the states while exponentially increasing the weight γ.
Two important concepts of RL are Monte Carlo learning, which is a naive idea in
which the agent interacts with the environment and learns about the states and rewards,
and temporal difference (TD) learning, i.e., updating the value at every time step rather
than being required to wait to update the values until the end of the episode.
Although it is difficult to make a standardized classification of RL algorithms due
to their wide modularity, many current studies tend to divide them into value-based,
policy-based, and actor–critic algorithms (see Figure 3).
Sensors 2023, 23, 3762 5 of 35
3.2. Value-Based RL
3.2.1. Q-Learning
Q-learning [16] is a value-based TD method of reinforcement learning that uses Q-
values (also called state-action values) to iteratively develop the actions of the learning
agent. Q-learning learns the Bellman action-value function Q(s, a), which estimates how
good it is to take an action at a given state.
The Bellman action-value equation describes how to calculate the Q-value for an action
taken from a particular state, s. It is calculated as the sum of the immediate reward for
the current state and the discounted optimal Q-value for the next state, denoted by γ. In
Q-learning, the Q-value is updated using the following rule: The new Q-value is equal to
the old Q-value plus the temporal difference error. This update can be framed as trying to
minimize a loss function, such as the mean squared error loss:
2
r + γ max Q s0 , a0 ; θ 0 − Q(s, a; θ )
L = . (3)
a0
3.2.2. SARSA
The SARSA [17] algorithm is a policy-based variant of the well-known Q-learning
algorithm. Unlike Q-learning, which is an off-policy technique that learns the Q-value
using a greedy approach, SARSA is an on-policy technique that uses the action taken by
the current policy to learn the Q-value. To update Q-values, we use:
the epsilon-greedy action selection process. On the other hand, Q-learning is an off-policy
algorithm that learns the optimal policy directly. While this can be more efficient in certain
cases, it also has a higher per-sample variance and can be more difficult to converge when
used with neural networks.
0 0
θ µ ← τθ µ + (1 − τ )θ µ (6)
0 0
where θ µ and θ Q denote the parameters of the policy and Q-networks, respectively.
Figure 4 shows the overall workflow of a deep Q-network.
Q(s, a; θ ) = r + γQ s0 , argmaxa0 Q s0 , a0 ; θ ; θ 0
(7)
Here, the main neural network, θ, determines the best next action, a0 , while the target
network is used to evaluate this action and compute its Q-value. This simple change has
been shown to reduce overestimations and lead to better final policies.
Sensors 2023, 23, 3762 7 of 35
3.3. Policy-Based RL
Policy gradient (PG) methods are widely used reinforcement learning algorithms that
are particularly well-suited to situations with continuous action spaces [21]. The goal of an
RL agent using a PG method is to maximize the expected reward, J (πθ ) = E [ R(τ )], by
τ ∼π a
adjusting the policy parameters, θ. A standard approach to finding the optimal policy is to use
gradient ascent, in which the policy parameters are updated according to the following rule:
θ t +1 = θ t + α ∇ J ( π θ t ) (8)
where α is the learning rate, and ∇ J (πθ ) is the policy gradient. This gradient can be further
expanded and reformulated as:
" #
T
∇ J (πθ ) = E
τ ∼πθ
∑ ∇θ log πθ (at | st ) R(τ ) . (9)
t =0
In PG methods, the policy function, which maps states to actions, is learned explicitly
and actions are selected without using a value function.
where Qπ (s, a) is the action-value function, and Vπ (s) is the state-value function for the
policy π. Using this definition, the vanilla policy gradient (VPG) algorithm [21] can be
written as: " #
T
∇ J (πθ ) = E
τ ∼πθ
∑ ∇θ log πθ (at | st ) Aπ (s, a) (11)
t =0
where δ is the size of the trust region, and the KL divergence between the old and new
policies must be less than δ. This optimization problem can be solved using the conjugate
gradient method.
πθ ( at | st)
E Ât − C · KLπθold (πθ ) (14)
τ ∼πθ πθold ( at | st )
One challenge of PPO is choosing the appropriate value for C. To address this, the
algorithm updates C based on the magnitude of the KL divergence. If the KL divergence
is too high, C is increased, and, if it is too low, C is decreased. This allows for effective
optimization over the course of training.
3.4. Actor–Critic
Actor–critics [24] are RL algorithms that combine elements of both value-based and
policy-based methods. In this approach, the actor, a policy network, proposes an action
for a given state, while the critic, a value network, evaluates the action based on the state-
action pair. The critic uses the Bellman equation to learn the Q-function, and the actor is
updated based on the Q-function to train the policy. This allows the actor–critic approach
to take advantage of the strengths of both value-based and policy-based methods. Figure 6
illustrates the network architecture of the actor–critic.
Sensors 2023, 23, 3762 9 of 35
1 2
L =
N ∑ Qold − r (s, a) + γ max Qtarget s0 , a
a
(15)
i
Since the policy is deterministic, it suffered from inefficient exploration when the
agent was to explore the environment. To improve the DDPG policy, the authors added
Ornstein–Uhlenbeck noise [29] to the selected actions during training. However, more
recent research implies that uncorrelated, zero-mean Gaussian noise is effective. Figure 7
shows the updated rule of the deep deterministic policy gradient.
Sensors 2023, 23, 3762 10 of 35
The loss function implies that across all of the states that we sampled from our
experience replay buffer, we need to decrease the squared difference between the prediction
of our value network and the expected prediction of the Q-function minus the entropy of
the policy function π. We train the soft Q-network by minimizing the following error:
1 2
JQ (θ ) = E(st ,at )∼D Qθ (st , at ) − Q̂(st , at ) (18)
2
where h i
Q̂(st , at ) = r (st , at ) + γEst+1 ∼ p Vψ̄ (st+1 ) (19)
attempting to make the distribution of our policy function look more similar to the distri-
bution of the exponentiation of our Q-function.
After the discussion of RL algorithms, there are two possible approaches for finding
the optimal policy: on-policy and off-policy. These two terms are used to describe how, in a
general sense, the agent learns and behaves during the training phase, as well as the two
main ways that an agent can go about learning/behaving. Table 2 shows categories of the
different RL algorithms.
Table 3. Cont.
4. Reward Engineering
4.1. Imitation Learning
In imitation learning, the goal is to learn a policy that can mimic the behavior of an
expert. The expert’s behavior is represented as a set of demonstrations, which can be used
to train the policy. The policy is typically learned by minimizing the distance between
the expert’s behavior and the policy’s behavior, using a distance measure such as the KL
divergence or the maximum mean discrepancy. The advantage of imitation learning is that
it does not require a reward function to be specified, which can be difficult or infeasible in
some cases. However, it may not generalize well to situations that are different from those
encountered by the expert. The classification of imitation learning can be seen in Figure 8.
upon it. An important example of behavior cloning is ALVINN [33], a vehicle equipped
with sensors that has learned to map the sensor inputs into steering angles and drive
autonomously. Dean Pomerleau initiated this research in 1989, and it was also the first
implementation of imitation learning in general.
non-parametric data distribution, GCIL can improve the efficiency and effectiveness of
imitation learning.
5. Network Architecture
Neural networks are function approximators that are particularly effective for deep
reinforcement learning when the state space or action space is too large to be fully known.
One type of neural network commonly used in deep reinforcement learning is the multi-
layer perceptron (MLP), which consists of an input layer, a hidden layer, and an output
layer. Each node in the hidden and output layers is a neuron with a nonlinear activation
function, except for the input nodes. MLPs are trained using the supervised learning
technique of backpropagation. One drawback of MLPs is that they are fully connected,
meaning that each perceptron is connected to every other perceptron. This can lead to
a large number of parameters and redundant information in high dimensions, making
them inefficient.
such as an image, and can learn to recognize patterns and features within the data through
the use of convolutional layers. Convolutional layers apply a set of filters to the input data,
which extract important features and compress the information. These features are then
passed through pooling layers, which reduce the dimensionality of the data and make the
network more robust to small changes in the input. The output of the CNN can then be
used as input for a value or policy function in a reinforcement learning algorithm.
6.1. Sim-to-Real
Training a robot in simulation is often easier than training it in the real world, as RL
algorithms often require a large number of samples and exploration can be risky for both
the robot and the environment. However, simulators may not always accurately reflect
reality, making it challenging to transfer policies trained in simulation to the real world,
a process known as sim-to-real transfer. Table 4 show the list of research papers about
sim-to-real implementation relative to the RL algorithms and learning techniques that they
used. Peng et al. [54] proposed using dynamics randomization to train recurrent policies in
simulation and deploy them directly on a physical robot, achieving good performance on an
object-pushing task without calibration. However, this approach does not consider visual
observations. OpenAI [55] has suggested using proximal policy optimization (PPO) with
LSTMs to learn the dexterity of in-hand manipulation for a physical Shadow Dexterous
Hand, using the learned policy to perform object reorientation and transferring the policy
directly to the physical hand. This work shares code with OpenAI Five, a system used to
play the game Dota 2.
Table 4. A list of papers about sim-to-real implementation relative to RL Algorithms and learn-
ing techniques.
Table 4. Cont.
Rusu et al. [68] proposed using progressive networks [91] to address the sim-to-real
transfer problem. Progressive networks incorporate lateral connections to link the layers
of previously trained network columns with each new column. This approach facilitates
transfer learning, domain adaptation, and compositionality. In such networks, columns
can have different capacities and structures, allowing the column intended for simulation
to have greater capacity than the one designed for reality. The latter can be initialized from
the former, enabling exploration and rapid learning from limited real data. Rusu et al. [68]
utilized the MuJoCo physics simulator to train the first column for a reaching task with
a simulated Jaco robot. They then used RGB images from a real Jaco robot to train the
second column. To handle dynamic tasks, such as dynamic targets, the authors suggested
incorporating a third column that utilizes proprioception features, including joint angles
and velocities for the arms and fingers.
Sadeghi et al. [92] introduced a convolutional recurrent neural network for teaching
a robot arm to perceive and manipulate desired objects from different viewpoints. The
method utilizes previous movements as a means of selecting actions for reaching the target,
rather than assuming known dynamics or requiring calibration. To learn the controller,
the authors proposed using simulated demonstration trajectories in conjunction with
reinforcement learning. While supervised demonstration learning often yields a policy that
focuses solely on achieving the goal in the short term, RL aids in the development of a
more long-term policy by assessing the action-value function to determine if the goal is
attainable. Additionally, the visual layers are fine-tuned using a limited number of realistic
images, which enhances the transfer performance.
Gu et al. [93] showed that a deep reinforcement learning algorithm based on off-policy
training of deep Q-functions can be scaled to handle complex 3D manipulation tasks. Moreover,
this approach is capable of effectively training deep neural network policies for use on physical
robots. The authors utilized a normalized advantage function algorithm that enabled them
to achieve training times appropriate for real-world robotic systems. Sun et al. [94] presented
reinforcement learning for mobile manipulation (ReLMM), a system for autonomously learning
mobile manipulation skills in the real world with minimal human intervention and without
instrumentation. The authors applied stationary and autonomous curricula for grasping policy.
Ding et al. [95] showed that using feedback from tactile sensor arrays located at the gripper for
learning and controlling can improve grasping stability and significantly enhance the perfor-
mance of robotic manipulation tasks, particularly for door-opening. This was demonstrated in
both simulation and reality settings.
movement primitives in a hybrid trajectory and force learning framework. The method
has been demonstrated to be more generalizable in both simulation and actual hardware
environments, safer, and more sample-efficient.
For the purpose of downstream policy learning in robotic manipulation tasks, Von
Hartz et al. [107] describe a technique for learning visual keypoints via dense correspon-
dence. The method uses raw camera observations to learn picture keypoints, making
policy learning more effective while addressing the issue of computationally expensive and
data-intensive policy learning. The method’s adaptability and efficiency are demonstrated
through evaluations of a variety of manipulation tasks and comparisons to other visual
representation learning methodologies.
to develop robotic abilities through behavior cloning (BC). The work demonstrates that
when learning robotic abilities from fixed datasets, implicit models can perform as well as or
better than explicit methods. Their study also shows that the suggested strategy works well
for tasks requiring high-dimensional manipulation and movement. The article provides a
unified framework by demonstrating the tight connections between the suggested implicit
technique and other well-known RL via supervised learning methods.
Transformers may be used in robotic manipulation jobs, which frequently need costly
and restricted data, according to Shridhar et al. [113]. The authors suggest PerAct, a
language-conditioned behavior-cloning agent that outputs discretized actions after using a
perceiver transformer to encode language objectives and RGB-D voxel observations. PerAct
offers a powerful structural prior for quickly learning 6 DoF actions since it operates on
voxelized 3D observations and actions. In just a few demos per task, the authors show
how PerAct can teach a single multi-task transformer to perform 18 RLBench tests and
seven real-world tasks. PerAct performs better than 3D ConvNet baselines and other
image-to-action agents, according to the results, for a variety of tabletop activities.
When learning from expert datasets, Wang et al. [114] use conventional behavioral
cloning to outperform cutting-edge offline RL algorithms. By filtering the expert data and
performing BC on the subset, a semi-supervised classification strategy is employed to en-
hance the outcomes while learning from mixed datasets. To make use of the environment’s
geometric symmetry, simple data augmentation is also performed. The BC policies that
were submitted outperformed the mean return of the corresponding raw datasets, and the
policies that were trained on the filtered mixed datasets almost equaled the results of the
policies that were trained on the expert datasets.
6.2.3. Hierarchical RL
Finn et al. [115] investigated inverse reinforcement learning, or inverse optimal control,
for control applications. The authors suggested employing nonlinear cost functions, such
as neural networks, to introduce structure to the cost via informative features and effective
regularization. Additionally, the authors proposed approximating MaxEnt, as proposed by
Ziebart and colleagues [116], using samples for learning in high-dimensional continuous
environments where the dynamics are unknown.
Li et al. [117] suggest utilizing reinforcement learning from demonstrations and
temporal logic to handle the problem of learning tasks with complicated temporal structures
and lengthy horizons (TL). Based on the TL job specification, the technique creates intrinsic
incentives, creating a policy with an understandable and hierarchical structure. The method
is tested on a variety of robotic manipulation tasks, showing that it can handle challenging
situations and outperform baselines.
For the autonomous learning and enhancement of numerous gripping methods,
Osa et al. [118] suggest a hierarchical policy search strategy. The technique makes use
of human examples to establish the grasping strategies before automatically compiling a
database of grasping actions and object point clouds to describe the grasp location and
policy as a bandit issue. The framework uses reinforcement learning to grasp items that are
hard and malleable and exhibits great accuracy while grabbing objects that were not before
seen. The suggested method solves the problem of building a comprehensive training
dataset and enables a robotic system to learn and enhance its grasping technique on its own.
To tackle problems with scarce rewards, Zhang et al. [119] suggest using a hierarchical
reinforcement learning system that can learn choices independently of the task. HIDIO pro-
motes option learning at a lower level that is independent of the job at hand, in contrast to
other systems that design low-level tasks or pre-define rules. These choices are discovered
using an inherent entropy reduction objective that is dependent on the sub-trajectories of
the options. In studies on robotic manipulation and navigation tasks, HIDIO outperforms
two cutting-edge hierarchical RL approaches and ordinary RL baselines in terms of success
rates and sampling efficiency.
Sensors 2023, 23, 3762 22 of 35
network and a visual servoing network, that were trained end-to-end and under self-
supervision make up KOVIS. The suggested technique demonstrates its efficacy in both
simulated and real-world situations by achieving zero-shot sim-to-real transfer to robotic
manipulation tasks such as grabbing, peg-in-hole insertion with 4 mm clearance, and M13
screw insertion.
Using inputs from a single camera, Yuan et al. [127] provide a sim-to-real learning
approach for vision-based assembly jobs that addresses safety constraints and sample
efficiency problems in training robots in the real world. The system contains a force control
transfer mechanism to close the reality gap as well as a domain adaptation technique
based on cycle-consistent generative adversarial networks (CycleGAN). The suggested
framework may be effectively applied to a genuine peg-in-hole arrangement after being
taught in a simulated environment.
A new technique for predicting domain randomization distributions for secure sim-to-
real transfer of reinforcement learning policies in robotic manipulation is introduced by
Tiboni et al. [128] as DROPO. Unlike earlier research, DROPO only needs a small, offline
dataset of trajectory data that has been acquired in advance, and it directly represents
parameter uncertainty using a likelihood-based method. The article shows how DROPO
can recover dynamic parameter distributions during simulation and identify a distribution
that can account for unmodeled phenomena. Two zero-shot sim-to-real transfer scenarios
were used to test the method, which demonstrates effective domain transfer and enhanced
performance over earlier approaches.
The Kalman Randomized-to-Canonical Model, a zero-shot sim-to-real transferable vi-
sual model predictive control (MPC) technique, is presented by Yamanokuchi et al. [129].
The suggested system utilizes the KRC model to extract intrinsic characteristics and dynamics
that are task-relevant from randomized pictures. Via a block-mating task in simulation and a
valve-rotation task in both the real world and simulation, the effectiveness of the technique is
assessed. The findings demonstrate that KRC-MPC may be utilized in a zero-shot way across
a range of real domains and activities without the need for any actual data collection.
For robot manipulators, there are multiple reward engineering strategies available. Figure 9
shows the trend for reward engineering used in robotic manipulation from 2015 to 2022.
20,000 Imitation-learning
Behavior cloning
17,500 Hierarchical RL
15,000 GAIL
Curriculum learning
12,500 Transfer learning
10,000
7500
5000
2500
0
2015 2016 2017 2018 2019 2020 2021 2022
Year
Figure 9. The trend of published papers using different reward engineering in robotic manipulation.
Sensors 2023, 23, 3762 24 of 35
6.3. RL Techniques
6.3.1. Q-Learning
The difficulty of modifying robot learning systems to new surroundings, objects, and
perceptions is covered by Julian et al. [130]. Their paper offers a framework for fine-tuning
previously acquired behaviors through Q-learning, which promotes continual adaptation.
Their method has been demonstrated to be successful in a scenario of continuous learning
and results in significant performance increases with little data. Experiments on simulated
manipulation tasks and a genuine robotic grasping system that has been pre-trained on
580,000 grasps provide support for the findings.
ments in PyBullet and explored three different learning-style methods using wrapped PPO
and DQN algorithms from the OpenAI baseline. These methods were successfully applied
to solve diverse missions in the self-constructed environments.
By storing recovery actions in a separate safety buffer and utilizing k-means clustering to
choose the optimum recovery action when coming across comparable situations, Hsu et al. [136]
provide a method for assuring safety in deep reinforcement learning (RL). Six robotic control
tasks including navigation and manipulation are used to assess the proposed safety-aware RL
algorithm. The results demonstrate that it outperforms numerous baselines in both discrete and
continuous control problems, increasing safety throughout both the training and testing phases.
Iriondo et al. [137] provide a technique for picking items up in a mobile manipulator in
unstructured regions. For such activities, conventional path planning and control become
difficult. In this work, a controller for the robot base is trained using a deep reinforcement
learning (DRL) technique. This controller directs the platform to a location where the arm
may plan a route up to the item. The effectiveness of the DRL technique is assessed using
two DRL algorithms (DDPG and PPO), which are contrasted, and a specific robotic job.
20,000 q-learning
DQN
PPO
TRPO
15,000 DDPG
TD3
SAC
10,000
5000
0
2015 2016 2017 2018 2019 2020 2021 2022
Year
Figure 10. The trend of published papers using different RL algorithms in robotic manipulation.
6.3.8. GNN
Graph neural networks (GNNs) can be used as the network architecture for reinforcement
learning problems in relational environments. Table 5 features a list of papers about GNN
implementation relative to RL algorithms and learning techniques. Janisch et al. [151] proposed
a deep RL framework based on GNNs and auto-regressive policy decomposition that is well-
suited to these types of problems. They demonstrate that their approach, which uses GNNs,
can solve problems and generalize to different sizes without any prior knowledge.
Table 5. A list of papers about GNN implementation relative to RL algorithms and learning techniques.
Yunfei Li et al. [125] study a challenging assembly task in which a robot arm must
build a feasible bridge based on a given design. The authors divide the problem into two
steps. First, they use an attention-based neural network as a “blueprint policy” to assemble
the bridge in a simulation environment. Then, they implement a motion-planning-based
policy for real-robot motion control.
Lin et al. [153] present a method for robot manipulation using graph neural networks
(GNNs). The authors propose a GNN-based approach for solving the object grasping
problem, in which the robot must select a suitable grasping pose for a given object. The
approach combines a graph representation of the object with a GNN to learn a grasp quality
function, which predicts the success of a grasp based on the object’s geometry and the
robot’s kinematics. The authors evaluate their method on a dataset of synthetic objects and
a real-world grasping task and demonstrate that it outperforms previous approaches in
terms of both efficiency and interpretability.
According to the review above, it appears that deep reinforcement learning has been
effectively used for a range of robotic manipulation tasks, such as grasping, pushing, and
door-opening. For these tasks, a variety of strategies have been put forth to increase the
sample efficiency of RL, including the use of demonstrations, supplemental incentives, and
experience replay. It has also been demonstrated that other techniques, such as model-free
approaches, residual learning, and asymmetric self-play, are useful for robotic manipulation.
Overall, it seems that there is still a lot of opportunity for future study in this field, especially
in terms of figuring out how to learn complicated manipulation tasks with scarce incentives
more effectively and in terms of applying learned rules from a simulation to the real world.
7. Conclusions
In this survey, we have provided an overview of deep reinforcement learning algo-
rithms for robotic manipulation. We have discussed the various approaches that have been
taken to address the challenges of learning manipulation tasks, including sim-to-real, re-
ward engineering, value-based, and policy-based approaches. We have also highlighted the
key challenges that remain in this field, including improving sample efficiency, developing
transfer learning capabilities, achieving real-time control, enabling safe exploration, and
integrating with other learning paradigms. One of the key strengths of the survey is its
comprehensive coverage of the current state-of-the-art in DRL and exploring a wide range
Sensors 2023, 23, 3762 30 of 35
of techniques and applications for robotic manipulation. The survey does not, however,
go into great depth on all strategies due to space considerations. Nevertheless, the survey
can greatly benefit researchers and practitioners in the field of robotics and reinforcement
learning by providing insights into the advantages and limitations of various algorithms
and guiding the development of new approaches. Overall, this survey serves as a valuable
resource for understanding the current landscape of deep reinforcement learning in robotic
manipulation and can inspire further research to advance the field.
Future research in this field should concentrate more on overcoming these difficulties
and figuring out how to improve the performance and efficiency of deep reinforcement
learning algorithms for robotic manipulation tasks. By achieving this, we can get one step
closer to developing intelligent robots that can adjust to different surroundings and learn
novel abilities on their own. It is likely that as the area of RL for robotic manipulations
develops, we will witness the creation of increasingly more sophisticated algorithms and
methodologies that will allow robots to perform increasingly difficult manipulation tasks.
Author Contributions: Conceptualization, D.H. and S.C.; methodology, D.H. and S.C.; investigation,
D.H.; validation, D.H. and B.M.; data curation, D.H. and B.M.; writing—original draft preparation,
D.H.; writing—review and editing, V.S. and S.C.; supervision, S.C. All authors have read and agreed
to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Lasi, H.; Fettke, P.; Kemper, H.G.; Feld, T.; Hoffmann, M. Industry 4.0. Bus. Inf. Syst. Eng. 2014, 6, 239–242. [CrossRef]
2. Sigov, A.; Ratkin, L.; Ivanov, L.A.; Xu, L.D. Emerging enabling technologies for Industry 4.0 and beyond. Inf. Syst. Front. 2022, 1–11.
[CrossRef]
3. Hua, J.; Zeng, L.; Li, G.; Ju, Z. Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning. Sensors
2021, 21, 1278. [CrossRef] [PubMed]
4. Mason, M.T. Toward Robotic Manipulation. Annu. Rev. Control. Robot. Auton. Syst. 2018, 1, 1–28. [CrossRef]
5. Hafiz, A.; Hassaballah, M.A.H. Reinforcement Learning with an Ensemble of Binary Action Deep Q-Networks. Comput. Syst. Sci. Eng.
2023, 46, 2651–2666. [CrossRef]
6. Hafiz, A.M.; Hassaballah, M.; Binbusayyis, A. Formula-Driven Supervised Learning in Computer Vision: A Literature Survey.
Appl. Sci. 2023, 13, 723. [CrossRef]
7. Morales, E.F.; Murrieta-Cid, R.; Becerra, I.; Esquivel-Basaldua, M.A. A survey on deep learning and deep reinforcement learning
in robotics with a tutorial on deep reinforcement learning. Intell. Serv. Robot. 2021, 14, 773–805. [CrossRef]
8. Rubagotti, M.; Sangiovanni, B.; Nurbayeva, A.; Incremona, G.P.; Ferrara, A.; Shintemirov, A. Shared Control of Robot Manipulators
With Obstacle Avoidance: A Deep Reinforcement Learning Approach. IEEE Control. Syst. Mag. 2023, 43, 44–63. [CrossRef]
9. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam,
V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–503. [CrossRef]
10. Zejnullahu, F.; Moser, M.; Osterrieder, J. Applications of Reinforcement Learning in Finance—Trading with a Double Deep
Q-Network. arXiv 2022, arXiv:2206.14267.
11. Ramamurthy, R.; Ammanabrolu, P.; Brantley, K.; Hessel, J.; Sifa, R.; Bauckhage, C.; Hajishirzi, H.; Choi, Y. Is Reinforcement
Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy
Optimization. arXiv 2023, arXiv:2210.01241.
12. Popova, M.; Isayev, O.; Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 2018, 4, eaap7885. [CrossRef]
[PubMed]
13. Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for
Autonomous Driving: A Survey. arXiv 2021, arXiv:2002.00444.
14. Elguea-Aguinaco, Í.; Serrano-Muñoz, A.; Chrysostomou, D.; Inziarte-Hidalgo, I.; Bøgh, S.; Arana-Arexolaleiba, N. A review on
reinforcement learning for contact-rich robotic manipulation tasks. Robot. Comput.-Integr. Manuf. 2023, 81, 102517. [CrossRef]
15. Wang, F.Y.; Zhang, J.J.; Zheng, X.; Wang, X.; Yuan, Y.; Dai, X.; Zhang, J.; Yang, L. Where does AlphaGo go: From church-turing
thesis to AlphaGo thesis and beyond. IEEE/CAA J. Autom. Sin. 2016, 3, 113–120.
Sensors 2023, 23, 3762 31 of 35
16. Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [CrossRef]
17. Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; University of Cambridge, Department of
Engineering: Cambridge, UK, 1994; Volume 37.
18. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep
reinforcement learning. arXiv 2013, arXiv:1312.5602.
19. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference
on Artificial Intelligence, the Phoenix Convention Center, Phoenix, AZ, USA, 12–17 February 2016; Volume 30.
20. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement
learning. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016;
pp. 1995–2003.
21. Sutton, R.S.; McAllester, D.A.; Singh, S.P.; Mansour, Y. Policy gradient methods for reinforcement learning with function approx-
imation. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 28–30 November 2000;
pp. 1057–1063.
22. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International
Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897.
23. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347.
24. Konda, V.R.; Tsitsiklis, J.N. Actor-critic algorithms. In Proceedings of the Advances in Neural Information Processing Systems,
Denver, CO, USA, 28–30 November 2000; pp. 1008–1014.
25. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for
deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA,
19–24 June 2016; pp. 1928–1937.
26. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep
reinforcement learning. arXiv 2015, arXiv:1509.02971.
27. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.;
Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef]
28. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms In Proceedings
of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014.
29. Uhlenbeck, G.E.; Ornstein, L.S. On the theory of the Brownian motion. Phys. Rev. 1930, 36, 823. [CrossRef]
30. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the
International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596.
31. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic
algorithms and applications. arXiv 2018, arXiv:1812.05905.
32. Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the
International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1352–1361.
33. Pomerleau, D.A. Alvinn: An Autonomous Land Vehicle in a Neural Network. In Artificial Intelligence and Psychology; Technical
Report; Carnegie-Mellon University: Pittsburgh, PA, USA, 1989.
34. Christiano, P.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences.
arXiv 2017, arXiv:1706.03741.
35. Ng, A.Y.; Russell, S.J.; Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine
Learning, Stanford, CA, USA, 29 June–2 July 2000; Volume 1, p. 2.
36. Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K.; Maximum entropy inverse reinforcement learning. In Proceedings of
the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, Chicago, IL, USA,
13–17 July 2008; Volume 8, pp. 1433–1438.
37. Ramachandran, D.; Amir, E. Bayesian Inverse Reinforcement Learning. In Proceedings of the International Joint Conference on
Artificial Intelligence, Hyderabad, India, 6–12 January 2007; Volume 7, pp. 2586–2591.
38. Ho, J.; Ermon, S. Generative adversarial imitation learning. Adv. Neural Inf. Process. Syst. 2016, 29, 4565–4573.
39. Ding, Y.; Florensa, C.; Phielipp, M.; Abbeel, P. Goal-conditioned imitation learning. arXiv 2019, arXiv:1906.05838.
40. Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; Zaremba, W.
Hindsight experience replay. arXiv 2017, arXiv:1707.01495.
41. Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International
Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48.
42. Matiisen, T.; Oliver, A.; Cohen, T.; Schulman, J. Teacher–Student curriculum learning. IEEE Trans. Neural Netw. Learn. Syst. 2019,
31, 3732–3740. [CrossRef] [PubMed]
43. Sukhbaatar, S.; Lin, Z.; Kostrikov, I.; Synnaeve, G.; Szlam, A.; Fergus, R. Intrinsic motivation and automatic curricula via
asymmetric self-play. arXiv 2017, arXiv:1703.05407.
44. Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement
learning. Artif. Intell. 1999, 112, 181–211. [CrossRef]
45. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst.
2012, 25, 1097–1105. [CrossRef]
Sensors 2023, 23, 3762 32 of 35
46. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation; Technical Report; Institute for
Cognitive Science, California University of San Diego: La Jolla, CA, USA, 1985.
47. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
48. Sperduti, A.; Starita, A. Supervised neural networks for the classification of structures. IEEE Trans. Neural Netw. 1997, 8, 714–735.
[CrossRef]
49. Morris, C.; Ritzert, M.; Fey, M.; Hamilton, W.L.; Lenssen, J.E.; Rattan, G.; Grohe, M. Weisfeiler and leman go neural: Higher-order
graph neural networks. In Proceedings of the the Association for the Advancement of Artificial Intelligence (AAAI) Conference
on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4602–4609.
50. Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st International
Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1025–1035.
51. Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated graph sequence neural networks. arXiv 2015, arXiv:1511.05493.
52. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903.
53. Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [CrossRef]
54. Peng, X.B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. In
Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–26 May 2018;
pp. 3803–3810.
55. Andrychowicz, O.M.; Baker, B.; Chociej, M.; Jozefowicz, R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.;
et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 2020, 39, 3–20. [CrossRef]
56. Riedmiller, M.; Hafner, R.; Lampe, T.; Neunert, M.; Degrave, J.; Wiele, T.; Mnih, V.; Heess, N.; Springenberg, J.T. Learning
by playing solving sparse reward tasks from scratch. In Proceedings of the International Conference on Machine Learning,
Stockholm, Sweden, 10–15 July 2018; pp. 4344–4353.
57. Vecerik, M.; Hester, T.; Scholz, J.; Wang, F.; Pietquin, O.; Piot, B.; Heess, N.; Rothörl, T.; Lampe, T.; Riedmiller, M. Leveraging
demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv 2017, arXiv:1707.08817.
58. Kilinc, O.; Hu, Y.; Montana, G. Reinforcement learning for robotic manipulation using simulated locomotion demonstrations.
arXiv 2019, arXiv:1910.07294.
59. Chen, H. Robotic Manipulation with Reinforcement Learning, State Representation Learning, and Imitation Learning (Student
Abstract). In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial
Intelligence, Online, 2–9 February 2021; Volume 35, pp. 15769–15770.
60. Zhang, M.; Jian, P.; Wu, Y.; Xu, H.; Wang, X. DAIR: Disentangled Attention Intrinsic Regularization for Safe and Efficient
Bimanual Manipulation. arXiv 2021, arXiv:2106.05907.
61. Yamada, J.; Lee, Y.; Salhotra, G.; Pertsch, K.; Pflueger, M.; Sukhatme, G.S.; Lim, J.J.; Englert, P. Motion planner augmented
reinforcement learning for robot manipulation in obstructed environments. arXiv 2020, arXiv:2010.11940.
62. Yang, X.; Ji, Z.; Wu, J.; Lai, Y.K. An Open-Source Multi-Goal Reinforcement Learning Environment for Robotic Manipulation with
Pybullet. arXiv 2021, arXiv:2105.05985.
63. Vulin, N.; Christen, S.; Stevšić, S.; Hilliges, O. Improved learning of robot manipulation tasks via tactile intrinsic motivation.
IEEE Robot. Autom. Lett. 2021, 6, 2194–2201. [CrossRef]
64. Silver, T.; Allen, K.; Tenenbaum, J.; Kaelbling, L. Residual policy learning. arXiv 2018, arXiv:1812.06298.
65. Deisenroth, M.P.; Rasmussen, C.E.; Fox, D. Learning to control a low-cost manipulator using data-efficient reinforcement learning.
In Robotics: Science and Systems VII; MIT Press: Cambridge, MA, USA, 2011; Volume 7, pp. 57–64.
66. Li, R.; Jabri, A.; Darrell, T.; Agrawal, P. Towards practical multi-object manipulation using relational reinforcement learning. In Proceedings
of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Online, 31 May–31 August 2020; pp. 4051–4058.
67. Popov, I.; Heess, N.; Lillicrap, T.; Hafner, R.; Barth-Maron, G.; Vecerik, M.; Lampe, T.; Tassa, Y.; Erez, T.; Riedmiller, M.
Data-efficient deep reinforcement learning for dexterous manipulation. arXiv 2017, arXiv:1704.03073.
68. Rusu, A.A.; Večerík, M.; Rothörl, T.; Heess, N.; Pascanu, R.; Hadsell, R. Sim-to-real robot learning from pixels with progressive
nets. In Proceedings of the Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 262–270.
69. Nair, A.; McGrew, B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Overcoming exploration in reinforcement learning with
demonstrations. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane,
Australia, 21–26 May 2018; pp. 6292–6299.
70. OpenAI; Plappert, M.; Sampedro, R.; Xu, T.; Akkaya, I.; Kosaraju, V.; Welinder, P.; D’Sa, R.; Petron, A.; Pinto, H.P.d.O.; et al.
Asymmetric self-play for automatic goal discovery in robotic manipulation. arXiv 2021, arXiv:2101.04882.
71. Zhan, A.; Zhao, P.; Pinto, L.; Abbeel, P.; Laskin, M. A Framework for Efficient Robotic Manipulation. arXiv 2020, arXiv:2012.07975.
72. Franceschetti, A.; Tosello, E.; Castaman, N.; Ghidoni, S. Robotic arm control and task training through deep reinforcement
learning. In Intelligent Autonomous Systems 16, Proceedings of the 16th International Conference IAS-16, Singapore, 29–31 July 2020;
Springer: Berlin/Heidelberg, Germany, 2022; pp. 532–550.
73. Lu, L.; Zhang, M.; He, D.; Gu, Q.; Gong, D.; Fu, L. A Method of Robot Grasping Based on Reinforcement Learning. J. Phys. Conf. Ser.
2022, 2216, 012026. [CrossRef]
74. Davchev, T.; Luck, K.S.; Burke, M.; Meier, F.; Schaal, S.; Ramamoorthy, S. Residual learning from demonstration: Adapting dmps
for contact-rich manipulation. IEEE Robot. Autom. Lett. 2022, 7, 4488–4495. [CrossRef]
Sensors 2023, 23, 3762 33 of 35
75. Zhang, X.; Jin, S.; Wang, C.; Zhu, X.; Tomizuka, M. Learning insertion primitives with discrete-continuous hybrid action space for
robotic assembly tasks. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia,
PA, USA, 23–27 May 2022; pp. 9881–9887.
76. Zhao, T.Z.; Luo, J.; Sushkov, O.; Pevceviciute, R.; Heess, N.; Scholz, J.; Schaal, S.; Levine, S. Offline meta-reinforcement learning for
industrial insertion. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA,
23–27 May 2022; pp. 6386–6393.
77. Ding, Y.; Zhao, J.; Min, X. Impedance control and parameter optimization of surface polishing robot based on reinforcement
learning. Proc. Inst. Mech. Eng. Part B J. Eng. Manuf. 2023, 237, 216–228. [CrossRef]
78. Belousov, B.; Wibranek, B.; Schneider, J.; Schneider, T.; Chalvatzaki, G.; Peters, J.; Tessmann, O. Robotic architectural assembly
with tactile skills: Simulation and optimization. Autom. Constr. 2022, 133, 104006. [CrossRef]
79. Lin, N.; Li, Y.; Tang, K.; Zhu, Y.; Zhang, X.; Wang, R.; Ji, J.; Chen, X.; Zhang, X. Manipulation planning from demonstration via
goal-conditioned prior action primitive decomposition and alignment. IEEE Robot. Autom. Lett. 2022, 7, 1387–1394. [CrossRef]
80. Cong, L.; Liang, H.; Ruppel, P.; Shi, Y.; Görner, M.; Hendrich, N.; Zhang, J. Reinforcement learning with vision-proprioception
model for robot planar pushing. Front. Neurorobot. 2022, 16, 829437. [CrossRef]
81. Kim, S.; Jo, H.; Song, J.B. Object manipulation system based on image-based reinforcement learning. Intell. Serv. Robot. 2022, 15, 171–177.
[CrossRef]
82. Nasiriany, S.; Liu, H.; Zhu, Y. Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks. In
Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022;
pp. 7477–7484.
83. Anand, A.S.; Myrestrand, M.H.; Gravdahl, J.T. Evaluation of variable impedance-and hybrid force/motioncontrollers for learning force
tracking skills. In Proceedings of the 2022 IEEE/SICE International Symposium on System Integration (SII), Online, 9–12 January 2022;
pp. 83–89.
84. Deisenroth, M.; Rasmussen, C.E. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th
International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 465–472.
85. Ben-Iwhiwhu, E.; Dick, J.; Ketz, N.A.; Pilly, P.K.; Soltoggio, A. Context meta-reinforcement learning via neuromodulation. Neural Netw.
2022, 152, 70–79. [CrossRef]
86. Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe learning in robotics: From learning-based
control to safe reinforcement learning. Annu. Rev. Control. Robot. Auton. Syst. 2022, 5, 411–444. [CrossRef]
87. Wabersich, K.P.; Zeilinger, M.N. Linear model predictive safety certification for learning-based control. In Proceedings of the
2018 IEEE Conference on Decision and Control (CDC), Miami Beach, FL, USA, 17–19 December 2018; pp. 7130–7135.
88. Beyene, S.W.; Han, J.H. Prioritized Hindsight with Dual Buffer for Meta-Reinforcement Learning. Electronics 2022, 11, 4192.
[CrossRef]
89. Shao, Q.; Qi, J.; Ma, J.; Fang, Y.; Wang, W.; Hu, J. Object detection-based one-shot imitation learning with an RGB-D camera. Appl. Sci.
2020, 10, 803. [CrossRef]
90. Ho, D.; Rao, K.; Xu, Z.; Jang, E.; Khansari, M.; Bai, Y. Retinagan: An object-aware approach to sim-to-real transfer. In
Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021;
pp. 10920–10926.
91. Rusu, A.A.; Rabinowitz, N.C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; Hadsell, R. Progressive
neural networks. arXiv 2016, arXiv:1606.04671.
92. Sadeghi, F.; Toshev, A.; Jang, E.; Levine, S. Sim2real view invariant visual servoing by recurrent control. arXiv 2017,
arXiv:1712.07642.
93. Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy
updates. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Marina Bay Sands,
Singapore, 29 May–3 June 2017; pp. 3389–3396.
94. Sun, C.; Orbik, J.; Devin, C.; Yang, B.; Gupta, A.; Berseth, G.; Levine, S. Fully Autonomous Real-World Reinforcement Learning
for Mobile Manipulation. arXiv 2021, arXiv:2107.13545.
95. Ding, Z.; Tsai, Y.Y.; Lee, W.W.; Huang, B. Sim-to-Real Transfer for Robotic Manipulation with Tactile Sensory. arXiv 2021,
arXiv:2103.00410.
96. Duan, Y.; Andrychowicz, M.; Stadie, B.C.; Ho, J.; Schneider, J.; Sutskever, I.; Abbeel, P.; Zaremba, W. One-shot imitation learning.
arXiv 2017, arXiv:1703.07326.
97. Finn, C.; Yu, T.; Zhang, T.; Abbeel, P.; Levine, S. One-shot visual imitation learning via meta-learning. In Proceedings of the
Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 357–368.
98. Yu, T.; Finn, C.; Xie, A.; Dasari, S.; Zhang, T.; Abbeel, P.; Levine, S. One-shot imitation from observing humans via domain-adaptive
meta-learning. arXiv 2018, arXiv:1802.01557.
99. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the
International Conference on Machine Learning, Stockholm, Sweden, 6–11 August 2017; pp. 1126–1135.
100. Wang, Z.; Merel, J.; Reed, S.; Wayne, G.; de Freitas, N.; Heess, N. Robust imitation of diverse behaviors. arXiv 2017, arXiv:1707.02747.
101. Zhou, A.; Kim, M.J.; Wang, L.; Florence, P.; Finn, C. NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via
Novel-View Synthesis. arXiv 2023, arXiv:2301.08556.
Sensors 2023, 23, 3762 34 of 35
102. Li, K.; Chappell, D.; Rojas, N. Immersive Demonstrations are the Key to Imitation Learning. arXiv 2023, arXiv:2301.09157.
103. Tong, D.; Choi, A.; Terzopoulos, D.; Joo, J.; Jawed, M.K. Deep Learning of Force Manifolds from the Simulated Physics of Robotic
Paper Folding. arXiv 2023, arXiv:2301.01968.
104. Zhang, D.; Fan, W.; Lloyd, J.; Yang, C.; Lepora, N.F. One-Shot Domain-Adaptive Imitation Learning via Progressive Learning
Applied to Robotic Pouring. arXiv 2022, arXiv:2204.11251.
105. Yi, J.B.; Kim, J.; Kang, T.; Song, D.; Park, J.; Yi, S.J. Anthropomorphic Grasping of Complex-Shaped Objects Using Imitation
Learning. Appl. Sci. 2022, 12, 12861. [CrossRef]
106. Wang, Y.; Beltran-Hernandez, C.C.; Wan, W.; Harada, K. An adaptive imitation learning framework for robotic complex
contact-rich insertion tasks. Front. Robot. 2022, 8, 414. [CrossRef]
107. von Hartz, J.O.; Chisari, E.; Welschehold, T.; Valada, A. Self-Supervised Learning of Multi-Object Keypoints for Robotic Manipulation.
arXiv 2022, arXiv:2205.08316.
108. Zhou, Y.; Aytar, Y.; Bousmalis, K. Manipulator-independent representations for visual imitation. arXiv 2021, arXiv:2103.09016.
109. Jung, E.; Kim, I. Hybrid imitation learning framework for robotic manipulation tasks. Sensors 2021, 21, 3409. [CrossRef] [PubMed]
110. Bong, J.H.; Jung, S.; Kim, J.; Park, S. Standing Balance Control of a Bipedal Robot Based on Behavior Cloning. Biomimetics 2022,
7, 232. [CrossRef]
111. Shafiullah, N.M.M.; Cui, Z.J.; Altanzaya, A.; Pinto, L. Behavior Transformers: Cloning k modes with one stone. arXiv 2022,
arXiv:2206.11251.
112. Piche, A.; Pardinas, R.; Vazquez, D.; Mordatch, I.; Pal, C. Implicit Offline Reinforcement Learning via Supervised Learning. arXiv
2022, arXiv:2210.12272.
113. Shridhar, M.; Manuelli, L.; Fox, D. Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv 2022, arXiv:2209.05451.
114. Wang, Q.; McCarthy, R.; Bulens, D.C.; Redmond, S.J. Winning Solution of Real Robot Challenge III. arXiv 2023, arXiv:2301.13019.
115. Finn, C.; Levine, S.; Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the
International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016; pp. 49–58.
116. Zhao, X.; Xia, L.; Zhang, L.; Ding, Z.; Yin, D.; Tang, J. Deep reinforcement learning for page-wise recommendations. In Proceedings
of the 12th ACM Conference on Recommender Systems, Vancouver, BC, Canada, 2 October 2018; pp. 95–103.
117. Li, X.; Ma, Y.; Belta, C. Automata guided reinforcement learning with demonstrations. arXiv 2018, arXiv:1809.06305.
118. Osa, T.; Peters, J.; Neumann, G. Hierarchical reinforcement learning of multiple grasping strategies with human instructions.
Adv. Robot. 2018, 32, 955–968. [CrossRef]
119. Zhang, J.; Yu, H.; Xu, W. Hierarchical reinforcement learning by discovering intrinsic options. arXiv 2021, arXiv:2101.06521.
120. Baram, N.; Anschel, O.; Caspi, I.; Mannor, S. End-to-end differentiable adversarial imitation learning. In Proceedings of the
International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 390–399.
121. Merel, J.; Tassa, Y.; TB, D.; Srinivasan, S.; Lemmon, J.; Wang, Z.; Wayne, G.; Heess, N. Learning human behaviors from motion
capture by adversarial imitation. arXiv 2017, arXiv:1707.02201.
122. Tsurumine, Y.; Matsubara, T. Goal-aware generative adversarial imitation learning from imperfect demonstration for robotic
cloth manipulation. Robot. Auton. Syst. 2022, 158, 104264. [CrossRef]
123. Zolna, K.; Reed, S.; Novikov, A.; Colmenarejo, S.G.; Budden, D.; Cabi, S.; Denil, M.; de Freitas, N.; Wang, Z. Task-relevant
adversarial imitation learning. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021;
pp. 247–263.
124. Yang, X.; Ji, Z.; Wu, J.; Lai, Y.K. Abstract demonstrations and adaptive exploration for efficient and stable multi-step sparse
reward reinforcement learning. In Proceedings of the 2022 27th International Conference on Automation and Computing (ICAC),
Bristol, UK, 1–3 September 2022; pp. 1–6.
125. Li, Y.; Kong, T.; Li, L.; Li, Y.; Wu, Y. Learning to Design and Construct Bridge without Blueprint. arXiv 2021, arXiv:2108.02439.
126. Puang, E.Y.; Tee, K.P.; Jing, W. Kovis: Keypoint-based visual servoing with zero-shot sim-to-real transfer for robotics manipulation. In
Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Online, 25 October–24 December
2020; pp. 7527–7533.
127. Yuan, C.; Shi, Y.; Feng, Q.; Chang, C.; Liu, M.; Chen, Z.; Knoll, A.C.; Zhang, J. Sim-to-Real Transfer of Robotic Assembly with
Visual Inputs Using CycleGAN and Force Control. In Proceedings of the 2022 IEEE International Conference on Robotics and
Biomimetics (ROBIO), Xishuangbanna, China, 5–9 December 2022; pp. 1426–1432.
128. Tiboni, G.; Arndt, K.; Kyrki, V. DROPO: Sim-to-Real Transfer with Offline Domain Randomization. arXiv 2022, arXiv:2201.08434.
129. Yamanokuchi, T.; Kwon, Y.; Tsurumine, Y.; Uchibe, E.; Morimoto, J.; Matsubara, T. Randomized-to-Canonical Model Predictive
Control for Real-World Visual Robotic Manipulation. IEEE Robot. Autom. Lett. 2022, 7, 8964–8971. [CrossRef]
130. Julian, R.; Swanson, B.; Sukhatme, G.S.; Levine, S.; Finn, C.; Hausman, K. Efficient adaptation for end-to-end vision-based robotic
manipulation. In Proceedings of the 4th Lifelong Machine Learning Workshop at ICML, Online, 13–18 July 2020.
131. Rammohan, S.; Yu, S.; He, B.; Hsiung, E.; Rosen, E.; Tellex, S.; Konidaris, G. Value-Based Reinforcement Learning for Continuous
Control Robotic Manipulation in Multi-Task Sparse Reward Settings. arXiv 2021, arXiv:2107.13356.
132. Wang, D.; Walters, R. So (2) equivariant reinforcement learning. In Proceedings of the International Conference on Learning
Representations, Online, 25–29 April 2022.
133. Deng, Y.; Guo, D.; Guo, X.; Zhang, N.; Liu, H.; Sun, F. MQA: Answering the question via robotic manipulation. arXiv 2020,
arXiv:2003.04641.
Sensors 2023, 23, 3762 35 of 35
134. Imtiaz, M.B.; Qiao, Y.; Lee, B. Prehensile and Non-Prehensile Robotic Pick-and-Place of Objects in Clutter Using Deep Reinforce-
ment Learning. Sensors 2023, 23, 1513. [CrossRef]
135. Sarantopoulos, I.; Kiatos, M.; Doulgeri, Z.; Malassiotis, S. Split deep q-learning for robust object singulation. In Proceedings of
the 2020 IEEE International Conference on Robotics and Automation (ICRA), Online, 31 May–31 August 2020; pp. 6225–6231.
136. Hsu, H.L.; Huang, Q.; Ha, S. Improving safety in deep reinforcement learning using unsupervised action planning. In Proceedings
of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 5567–5573.
137. Iriondo, A.; Lazkano, E.; Susperregi, L.; Urain, J.; Fernandez, A.; Molina, J. Pick and place operations in logistics using a mobile
manipulator controlled with deep reinforcement learning. Appl. Sci. 2019, 9, 348. [CrossRef]
138. Hwangbo, J.; Lee, J.; Dosovitskiy, A.; Bellicoso, D.; Tsounis, V.; Koltun, V.; Hutter, M. Learning agile and dynamic motor skills for
legged robots. Sci. Robot. 2019, 4, eaau5872. [CrossRef] [PubMed]
139. Clegg, A.; Yu, W.; Tan, J.; Kemp, C.C.; Turk, G.; Liu, C.K. Learning human behaviors for robot-assisted dressing. arXiv 2017,
arXiv:1709.07033.
140. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952.
141. Wang, Q.; Sanchez, F.R.; McCarthy, R.; Bulens, D.C.; McGuinness, K.; O’Connor, N.; Wüthrich, M.; Widmaier, F.; Bauer, S.;
Redmond, S.J. Dexterous robotic manipulation using deep reinforcement learning and knowledge transfer for complex sparse
reward-based tasks. Expert Syst. 2022, e13205. [CrossRef]
142. Luu, T.M.; Yoo, C.D. Hindsight Goal Ranking on Replay Buffer for Sparse Reward Environment. IEEE Access 2021, 9, 51996–52007.
[CrossRef]
143. Eppe, M.; Magg, S.; Wermter, S. Curriculum goal masking for continuous deep reinforcement learning. In Proceedings of the 2019 Joint
IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), Norway, 19–22 August 2019;
pp. 183–188.
144. Sehgal, A.; La, H.; Louis, S.; Nguyen, H. Deep reinforcement learning using genetic algorithm for parameter optimization. In Proceedings
of the 2019 Third IEEE International Conference on Robotic Computing (IRC), Naples, Italy, 25–27 February 2019; pp. 596–601.
145. Sehgal, A.; Ward, N.; La, H.; Louis, S. Automatic parameter optimization using genetic algorithm in deep reinforcement learning
for robotic manipulation tasks. arXiv 2022, arXiv:2204.03656.
146. Nair, A.V.; Pong, V.; Dalal, M.; Bahl, S.; Lin, S.; Levine, S. Visual reinforcement learning with imagined goals. arXiv 2018,
arXiv:1807.04742.
147. Meng, Z.; She, C.; Zhao, G.; De Martini, D. Sampling, communication, and prediction co-design for synchronizing the real-world
device and digital model in metaverse. IEEE J. Sel. Areas Commun. 2022, 41, 288–300. [CrossRef]
148. Li, H.; Zhou, X.H.; Xie, X.L.; Liu, S.Q.; Gui, M.J.; Xiang, T.Y.; Wang, J.L.; Hou, Z.G. Discrete soft actor-critic with auto-encoder on
vascular robotic system. Robotica 2022, 41, 1115–1126. [CrossRef]
149. Wang, D.; Jia, M.; Zhu, X.; Walters, R.; Platt, R. On-robot learning with equivariant models. In Proceedings of the Conference on
Robot Learning, Auckland, New Zealand, 14–18 December 2022.
150. Jian, P.; Yang, C.; Guo, D.; Liu, H.; Sun, F. Adversarial Skill Learning for Robust Manipulation. In Proceedings of the 2021 IEEE
International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 2555–2561.
151. Janisch, J.; Pevnỳ, T.; Lisỳ, V. Symbolic Relational Deep Reinforcement Learning based on Graph Neural Networks. arXiv 2020,
arXiv:2009.12462.
152. Almasan, P.; Suárez-Varela, J.; Badia-Sampera, A.; Rusek, K.; Barlet-Ros, P.; Cabellos-Aparicio, A. Deep reinforcement learning
meets graph neural networks: Exploring a routing optimization use case. arXiv 2019, arXiv:1910.07421.
153. Lin, Y.; Wang, A.S.; Undersander, E.; Rai, A. Efficient and interpretable robot manipulation with graph neural networks. arXiv
2021, arXiv:2102.13177.
154. Sieb, M.; Xian, Z.; Huang, A.; Kroemer, O.; Fragkiadaki, K. Graph-structured visual imitation. In Proceedings of the Conference
on Robot Learning, Online, 16–18 November 2020; pp. 979–989.
155. Xie, F.; Chowdhury, A.; De Paolis Kaluza, M.; Zhao, L.; Wong, L.; Yu, R. Deep imitation learning for bimanual robotic manipulation.
Adv. Neural Inf. Process. Syst. 2020, 33, 2327–2337.
156. Liang, J.; Boularias, A. Learning Category-Level Manipulation Tasks from Point Clouds with Dynamic Graph CNNs. arXiv 2022,
arXiv:2209.06331.
157. Oliva, M.; Banik, S.; Josifovski, J.; Knoll, A. Graph Neural Networks for Relational Inductive Bias in Vision-based Deep
Reinforcement Learning of Robot Control. In Proceedings of the 2022 International Joint Conference on Neural Networks
(IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–9.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.