0% found this document useful (0 votes)
50 views

Robotic Manufacturing

Uploaded by

didactechbolivia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Robotic Manufacturing

Uploaded by

didactechbolivia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/360080383

Robot learning towards smart robotic manufacturing: A review

Article in Robotics and Computer-Integrated Manufacturing · October 2022


DOI: 10.1016/j.rcim.2022.102360

CITATIONS READS

16 1,042

5 authors, including:

Zhihao Liu Quan Liu


Wuhan University of Technology Michigan State University
18 PUBLICATIONS 433 CITATIONS 813 PUBLICATIONS 11,595 CITATIONS

SEE PROFILE SEE PROFILE

Wenjun Xu Lihui Wang


Wuhan University of Technology KTH Royal Institute of Technology
146 PUBLICATIONS 2,203 CITATIONS 752 PUBLICATIONS 18,687 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

RCIM SI: Industrial Internet for Manufacturing View project

SYMBIO-TIC View project

All content following this page was uploaded by Lihui Wang on 21 April 2022.

The user has requested enhancement of the downloaded file.


Robotics and Computer–Integrated Manufacturing 77 (2022) 102360

Contents lists available at ScienceDirect

Robotics and Computer-Integrated Manufacturing


journal homepage: www.elsevier.com/locate/rcim

Full length Article

Robot learning towards smart robotic manufacturing: A review


Zhihao Liu a, b, c, Quan Liu a, b, Wenjun Xu a, b, *, Lihui Wang c, *, Zude Zhou a
a
School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China
b
Hubei Key Laboratory of Broadband Wireless Communication and Sensor Networks (Wuhan University of Technology), Wuhan 430070, China
c
Department of Production Engineering, KTH Royal Institute of Technology, SE-114 28 Stockholm, Sweden

A R T I C L E I N F O A B S T R A C T

Keywords: Robotic equipment has been playing a central role since the proposal of smart manufacturing. Since the
Robot learning beginning of the first integration of industrial robots into production lines, industrial robots have enhanced
Smart manufacturing productivity and relieved humans from heavy workloads significantly. Towards the next generation of
Robotic manufacturing
manufacturing, this review first introduces the comprehensive background of smart robotic manufacturing
Artificial intelligence
within robotics, machine learning, and robot learning. Definitions and categories of robot learning are sum­
marised. Concretely, imitation learning, policy gradient learning, value function learning, actor-critic learning,
and model-based learning as the leading technologies in robot learning are reviewed. Training tools, bench­
marks, and comparisons amongst different robot learning methods are delivered. Typical industrial applications
in robotic grasping, assembly, process control, and industrial human-robot collaboration are listed and discussed.
Finally, open problems and future research directions are summarised.

1. Introduction such a review is to give a concise introduction to smart robotic


manufacturing, machine learning and robot learning, especially the
Robotic equipment has been playing a central role since the proposal latest development of robot learning methodologies. Industrial appli­
of smart manufacturing, showing potential to enhance factory automa­ cations and future directions are also highlighted, delivering more
tion [1]. Beginning with the first integration of industrial robots into inspiration for readers from the field of manufacturing research.
production lines, robots like industrial manipulators, collaborative ro­ A bibliometric analysis was conducted given published papers
bots, automated guided vehicles (AGVs) and even unmanned aerial indexed by the Scopus database. Concretely, we analysed published
vehicles (UAVs) have been employed to beef up the manufacturing records by keywords “robot learning” from 1950 to 2020 and “robot
systems from one generation to another [2]. On the other hand, with the learning” with “manufacturing” from 1980 to 2020, as shown in Fig. 1.
development of artificial intelligence (AI) like deep learning and Fig. 1(c) to (f) are detailed data on “robot learning” together with
computing devices such as GPU (graphic processing unit) and NPU “manufacturing”, which is precisely our topic in this survey. Fig. 1(a)
(neural processing unit), robots are expected to be with enhanced in­ shows the published papers from 1980 to 2020. From 1980 to 2015,
telligence. This trend focuses on releasing the robots from the clamp­ papers increased slowly but witnessed a sharp increase after 2015. This
down of old-fashioned hard-programmed scenarios and investigates phenomenon also occurred on the developing curve of robot learning
opportunities where the robots can qualify for more flexible and research. Robot learning witnessed two sharp growth around 2008 and
human-centred tasks [3]. Aimed at more complex and flexible envi­ 2018, influenced by the proposals of deep learning and deep rein­
ronments, robots are supposed to master advanced skills by self-learning forcement learning around 2006 and 2015, respectively. From 2015 to
to achieve success, just like machine learning [4]. Towards the next 2020, the growth rate of robot learning research went up year by year.
generation of smart robotic manufacturing, this review puts robot As for papers indexed by “robot learning” and “manufacturing”, the top
learning technology under the spotlight and summarises the back­ five countries are the United States, China, Germany, Japan and the
ground, methodology, typical research, industrial application, as well as United Kingdom. Most of the papers are published within computer
the open challenges and future research directions. The motivation of science and engineering. Furthermore, Fig. 1(e) and (f) list the top 10

* Corresponding authors.
E-mail addresses: [email protected] (Z. Liu), [email protected] (Q. Liu), [email protected] (W. Xu), [email protected] (L. Wang), [email protected]
(Z. Zhou).

https://doi.org/10.1016/j.rcim.2022.102360
Received 22 January 2022; Received in revised form 5 April 2022; Accepted 7 April 2022
Available online 20 April 2022
0736-5845/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

authors and affiliations in this area. categories of robot learning technology. Published applications in the
To show the distinction of our review, similar review papers on robot area of manufacturing research are classified and analysed. What needs
learning are listed in Table 1, including six review or survey papers from to be pointed out is that robotics which is rare in manufacturing
different venues. Papers 1 and 2 emphasise robot learning on robotic research, such as humanoid robots, service robots, robots for medical
manipulation. Papers 3, 4 and 6 focus on learning from demonstration, surgery and rehabilitation, are omitted in this review. Finally, we try to
also known as imitation learning. Specifically, paper 3 takes path summarise the challenges and open problems of robot learning towards
planning as its background, while paper 6 is reviewed in the assembly smart robotic manufacturing together with possible research directions
orientation. Paper 5 stresses human-robot interaction in robotics and the for future work.
comfort in the collaboration. Apart from those review papers listed in The rest of the content is organised as follows. Section 2 introduces
Table 1, we underscore some different perspectives. Firstly, this review the background and concepts of smart robotic manufacturing, machine
considers manufacturing research being the main stage, which has not learning and robot learning. Section 3 focuses on the definitions and
been highlighted in other reviews. Secondly, we deliver a detailed categories. Section 4 introduces the methodologies and principles, while
methodology without over-written mathematical contents of different Section 5 discusses the training tools and benchmarks. Section 6

Fig. 1. Statistics from Scopus (search keywords: “robot learning” and “manufacturing”. (a) Papers by year (“robot learning” and “manufacturing”, 1980~2020); (b)
papers by year (“robot learning”, 1950~2020); (c) papers by country/territory: top 10; (d) papers by subject area; (e) papers by author: top 10; (f) papers by
affiliation: top 10.

2
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

Table 1 Industrial robots have been invented and utilised in factories based on
Published review papers on robot learning. the development of modern computers, integrated circuits, and robotics
No. Title Venue Year science [12]. After the first batch of industrial robots was employed in
the new factory of General Motors in Ohio, the breakthrough era of
1 Toward next-generation learned Science Robotics 2021
robot manipulation [5] industrial robots had come. Today, robots have been widely used in
2 A review of robot learning for Journal of Machine 2021 manufacturing, and their type extends from industrial robotic manipu­
manipulation: challenges, Learning Research lators from well-known vendors like ABB, KUKA, UR, FANUC and
representations, and algorithms [6] YASKAWA, to other kinds of robots, including AGVs and UAVs [13–15].
3 Robot learning from demonstration Science China 2020
for path planning: A review [7] Technological Sciences
Fig. 2 shows all types of robots on the shop floor on the right side. Those
4 Recent advances in robot learning Annual Review of Control, 2020 robotic applications contain carrying, painting, packaging, polishing,
from demonstration [8] Robotics, and Autonomous and emerging human-robot collaboration [16]. Hereto, robots have
Systems successfully freed human workers from repetitive and overloaded tasks
5 Learning and comfort in human- Applied Sciences 2019
in the factory and brought the manufacturing industry to a new gener­
robot interaction: A review [9]
6 Robot learning from demonstration Robotics 2018 ation of automation. However, eventually they did not cross the original
in robotic assembly: A survey [10] boundary of robotics that is to manipulate the physical world with the
control from computers. Challenges from dynamics, uncertainties, and
flexibility arise.
summarises the published industrial applications. Novel methods and The eagerness of enhanced robotics gives birth to smart robotic
applications published within ten years are emphasised in Sections 4 and manufacturing, where robots are designed to handle more complex tasks
6. Last but not least, open problems and future research directions are with a higher degree of intelligence [17–20]. Physically, robots are
delivered in Section 7. driven by motors and further by the electric power used on individual
motors. Motors generate force and operate the mechanical body of the
2. Background and concepts robot under the governance of robotic dynamics. With the rotation of the
motors inside a robot, the end effector, a.k.a. tool central point (TCP),
Robot learning towards smart manufacturing is a multidisciplinary can be moved to a position with a gesture in the Cartesian space under
research topic that contains robotics, machine learning, optimization, the rules of robotics kinematics. Finally, those motions can be designed
and manufacturing science. This section introduces brief background for multiple manufacturing tasks with different tools on the TCP. Hence,
and concepts of smart robotic manufacturing, machine learning, and there are three levels of abstraction about the robot control, listed
robot learning. below.

• Level 1: Motor level. Motor control stresses how much force needs to
2.1. Smart robotic manufacturing be generated on one motor. Often, it is influenced by the power
employed on a motor and then affects the objects with the direct
Robotics is the science of perceiving and acting in the physical world contact of end effectors. The rotation speed of motors also relies on
with computer-controlled mechanical devices [11]. Robotic devices
power. In manufacturing, force control is common in grasping,
vary from industrial robots, mobile robots, medical and surgery robots, levering or haptic human-robot collaboration [21], using torque
rehabilitation robots, drones, service robots, and so forth. Amongst in­
sensor [22] or sensorless estimation [23]. In customised robotic
dustrial robots, robotic manipulators are the most common devices.

Fig. 2. One exemplary application of smart robotic manufacturing.

3
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

devices, it is available for end-users to program the power of motors utilising given data is called “learning”, which is similar to the learning
by motor drivers and microcontrollers (e.g. by pulse-width modu­ capability and processes of animals. Since this development is usually
lation). However, the availability of motor power is usually blocked accomplished by machines like computers, the name “machine learning”
by commercial robot vendors, resulting in compromises with users. was raised accordingly.
Collaborative robots like KUKA iiwa, ABB GoFa, and UR robots with In machine learning, the fitted function takes the raw data or
full torque sensors bring more possibilities. handcrafted feature as the input and generates the output by passing the
• Level 2: Motion level. Motion control, a.k.a. motion planning, takes information through itself. Before deep learning, feature engineering is
a significant role in industrial robotic applications, from path plan­ exactly the discipline for extracting features from data manually or using
ning of AGV and UAV to trajectory planning of industrial robots. pre-defined rules. This process was replaced by more complex models
Sequential postures represent those motions on a plane or in Carte­ with a stronger presentation, such as deep neural networks. However,
sian space on the temporal dimension. The posture of a robot is neural networks are not the only type of model used as such a function.
embodied with position and gesture in the form of temporal values of Linear models, quadratic models, expressions of probability distribu­
joint angles. Control on the joint angles takes motor control as the tions (e.g. Gaussian estimation or Gaussian mixture models), even
foundation. Commercial industrial robots support their key func­ discrete tables are also common. For neural networks, convolutional
tionality as programmable motion control, such as ABB RAPID and neural networks (CNN) [28], recurrent neural networks (RNN) [29] or
KUKA KRL robotic programming languages. Motion level is an graph neural networks (GNN) [30] are alternatives apart from dense
abstraction of motor level because it focuses more on the motion, neural networks (DNN) in deep learning [31]. The proposal of different
path or trajectories of the robot or the TCP of the robot, while the models came from the problems themselves. If a problem is easy to
motors govern the momentary of movement. model or adequately dealt with known probability distributions, these
• Level 3: Task level. Task control is the level wrapping motor and models can be chosen with parameters to be learned. For tricky prob­
motion control into complete solutions towards industrial robotic lems, neural networks are more welcomed because of the proof that
tasks, for instance, packaging, transporting, cutting, welding, pol­ neural networks can represent any non-linear functions theoretically
ishing, painting, etc. A task is made up of multiple procedures. Fig. 2 [32]. The accessibility of model often depends on the dimension of data.
illustrates one procedure in a task. For robotic procedures in a task, it For instance, extracting semantic information from images needs an
is generally equal to a plan of the robot’s trajectory. One trajectory is approximator with stronger presentation capability like deep neural
a curve of sequential TCP postures in the Cartesian space with start networks because images usually have larger volume of data than
and end points. Start and end points vary due to different procedures. common numerical inputs. As for neural networks, they also vary due to
By splicing trajectories sequentially, a robot can generate logical different tasks. CNN is designed for data organised with geometrical
actions in the physical world. Nevertheless, trajectories are not the shapes such as images, while RNN suits sequential data like texts and
only element in one task. Triggering I/O signals, alerting on the user sound. No matter which model is selected as the function of machine
interfaces or writing into the database are also the procedures of a learning, the learning phase is to estimate the parameters in such a
task, especially when a comprehensive work cell is targeted. Overall, model based on the given data. Those elements like the model type and
task level takes motor and motion levels as an abstraction on account the inner settings (e.g. the combination of several Gaussian distribu­
of government from joint angle and torque on the trajectory. tions, the number of layers in a neural network) are so-called
hyper-parameters.
Control on motor, motion and task levels is easy to be confused. In The input data can be a scalar, a vector, a matrix, or a tensor within
fact, they are different subjects requiring various domains. Those do­ the mathematical definitions. In reality, it can be a numerical array, a
mains depend on multi-disciplines but are not limited to electro- paragraph of text, a piece of audio or a picture. The output can also have
mechanics, automatic control, production scheduling, robotic kine­ different dimensions based on the user’s requirements and sometimes
matics, robotic dynamics, planning algorithms and optimization. even the same as the input, e.g. variational auto-encoder (VAE) [33] and
Because of business secrets and patents, motor control of commercial generative adversarial network (GAN) [34]. As a matter of fact, machine
industrial robots is usually masked, emphasising industrial users’ mo­ learning uses its function to map data from input to output. Different
tion and task control. However, controlling motion and tasks in a static mapping results in different applications. For instance, mapping an
setting cannot be called smart or intelligent. Towards a new generation image to a label is recognition, mapping a sentence in one language to
of smart robotic manufacturing, robots need to adapt to dynamic and another is machine translation and mapping any data from it to itself can
variational environments [24]. Fig. 2 shows a scenario where humans be used for machine generation (e.g. painter or poet) or dimension
and robots collaborate in a shared space. Interference by humans makes reduction.
planned trajectories no longer feasible for the robot to follow. Hence, the Finding the function parameter is exactly the approach by which a
robot needs to be flexible to human interaction, e.g. adjusting its motion machine learns. However, parameter tuning is tightly related to the type
while approaching the end point [25]. The flexibility is not only related of machine learning technology. Machine learning methods are often
to motion level but also task level when humans change the planned categorised into three sub-classes: unsupervised learning, supervised
procedures [26], e.g. the order of procedures or time consumption. The learning, and reinforcement learning. Unsupervised learning focuses on
challenge is that interferences present uncertainties, leading to a trickier information processing without supervision signals, such as clustering.
issue: the robot is required with advanced intelligence to cover those Supervised learning provides supervision signals that can be used to
stochastic properties. Machine learning is exactly one promising infor­ calculate loss functions. Parameters of the function are tuned given the
mation technology to boost intelligence with robust representation. optimization of the cost function. Gradient methods are the most widely
used optimization approach because they can be effectively calculated
2.2. Machine learning by computers. Reinforcement learning has a view from another point,
which is based on the “trial-and-error” paradigm. Considering human
Machine learning is a field that bridges computation and statistics learning, supervised learning is just like learning from reading books,
with ties to information theory, signal processing, algorithm, control while reinforcement learning is like grasping knowledge from experi­
theory and optimization theory [27]. Nowadays, machine learning has ence. It is also straightforward that not all knowledge can be learned
become one of the most fascinating areas in artificial intelligence. The from books. One hypothesis is “reward is enough” [35], which declares
machine learning paradigm is straightforward, which is fitting functions that any intelligence and its associated abilities can be understood by
using data. The trained function is then used for an approximation of maximising the reward (or minimising the cost). On the other hand,
new data. The development from a weak function to a stronger one reinforcement learning is exactly one crucial method for robot learning,

4
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

which is explained in the following sections.

2.3. Robot learning

In the future, robots in manufacturing will not only be settled into


fixed stations executing the same procedures thousands of times but
involved in changing environments where tasks cannot be fully pre-
programmed. This challenge requires the robot to handle uncertainties
and act accordingly [36]. Robot learning is exactly the technology
realising that robots learn by themselves with the help of humans, using Fig. 4. The architecture of reinforcement learning-based robot learning.
their sensors and motors to extend and enhance their initial intelligence
to cope with new environments [37]. perceived by sensors is passed to a new iteration. Similar pipelines are
Robot learning is the synthesis of multiple machine learning tech­ often defined as the Markov decision process (MDP), a simplified
nologies in the contents of robotics [38]. It stresses robot learning and mathematical model for decision-making and state transition. Since
acting using machine learning technology. Unlike common machine decisions are made one after another, this problem is also called
learning, robot learning emphasises generating actions as the output sequential decision-making. Concretely, unlike machine learning su­
while observing the environment as the input. For instance, deep pervised by a direct loss function, robot learning relies on the accumu­
learning helps the robot handle unstructured environments, while lated reward (or cost) signals from the environment. Reinforcement
reinforcement learning provides formalisms for machine behaviour. learning plays the role of the backend mechanism when a robot learns.
This kind of paradigm is similar to how animals behave. Fig. 3 firstly The one who makes decisions is often called an agent who receives the
shows the pipeline of how humans watch and act. Humans perceive the observation and runs a policy. Observation varies from full observation
environment by their sense organs (e.g. eyes, nose, and hands) linked to to partial observation due to perception conditions. Policies are divided
the brain through the nervous system. For instance, the ray of light in the into on-policy methods and off-policy methods regarding sampling style.
environment is perceived by light-sensitive cells on the retina, followed Availability of the environment model leads to model-based and
by bio-electricity signals to the cerebral cortex. On the other hand, the model-free robot learning. We will review and discuss the details of
cerebral cortex also sends signals to the corresponding muscles to those terminologies in the following sections.
generate motion. This pipeline forms a close-loop style, just like auto­ Robot learning appears in various robotic skills. Learning for
matic control. For example, when a human is going to grasp a glass of manipulation focus on manipulating a vast array of objects through
milk, he will use his eyes to locate the glass and monitor his hand while continuous learning [41]. Robots are currently skilled at picking and
moving his arm to reach the glass, lift it, move it to his mouth, drink and manipulating with rigid geometry in repetitive style but weaker at
later put the glass back. Robotic devices in the closed-loop share similar variation in movements. Purposive reasoning is another challenging
behaviours, just like the example in Fig. 3. The robot utilises cameras to issue, especially in the interaction with humans [42], which hammers at
observe the environment and use computers to process perception data. reasoning human activities for enhanced human-robot collaboration.
Algorithms then send signals to the robot controller to drive the robotic Robot learning with the data-driven approximation (e.g. various neural
manipulator. However, no animal is born with such skills but learns how networks) shows excellent capability in robot grasping [43], agile motor
to observe, make decisions and act during growth. Furthermore, this control [44], motion planning [45], identifying by feeling [46], and so
learning style is obviously not fully supervised, but more like learning forth. Learning from imitation is also a classic but popular research
from experience or “trial-and-error” that reinforcement learning em­ topic, e.g. learning from human teachers [47], performance [48], or
phasises. Robot learning is inspired by how animals learn and adapt from other tasks that are beyond the imitation [49]. Those robotic skills
[39]. also show significant potential in manufacturing to shift from mass
While general machine learning focuses on prediction and classifi­ production to mass customization [50]. In smart robotic manufacturing,
cation, robot learning stresses more on the output of configuration robot learning has already shown its initial strength on motor, motion,
values of robotic systems, e.g. temporal joint values or task-oriented and task levels. We will review and summarise the published papers of
procedures. Those signals generate motion belonging to machine industrial applications in the following sections.
behaviour [40]. This kind of technology often takes the real-time data by
sensors from the physical world as the input, a.k.a. observation, illus­
3. Definitions and categories
trated in Fig. 4. Policy or controller is the core mapping observation to
actions indicated by robotic configurations. By doing so, one robot
In this section, we summarise indispensable preliminaries and defi­
realises to manipulate the physical environment and results in the state
nitions, as well as pipelines and categories. We tried our best to minimise
transition of the environment. Then, one new observation of the state
the mathematical contents with a clear representation of methodologies
sufficient for the papers reviewed.

3.1. Preliminaries and definitions

Recall what we illustrated in Fig. 4, key elements containing state,


observation, policy, action, reward, and environment are involved in
robot learning. Those fundamental definitions are based on robotics,
automatic control, and reinforcement learning. With the perspective of
robot learning, they are reviewed in detail.
Agent. Agent in robot learning is the robot or decision-maker in a
robotic manufacturing system running a policy. This kind of system has
the capability of perceiving and acting. An agent is a conceptual defi­
nition that is not involved in the algorithms.
State. State s (a.k.a. x in some papers) is the very actual physical
Fig. 3. Human behaviour and machine behaviour. existence of one robot and its shared environments, including object

5
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

surroundings, humans, or other robots. States form the state space policy. Sometimes, a discount factor γ is multiplied on the temporal
(continuous or discrete), which is often represented by S . reward signal for better convergence. We omit the MDP using cost sig­
Observation. Observation o is the representation of the current state nals since it is a negative version of reward-based robot learning. Fig. 5
perceived by sensors like cameras. The process from state to observation illustrates the relationship amongst different elements and spaces at a
is called perception. Due to the conditions of perception, observation glance.
can be divided into full-observation and partial-observation. The latter [ ]

means that sensors cannot observe the full state, which is common in the π∗ = argmaxEτ∼πθ (τ) r(st , at ) (3.3)
occlusion of multiple objects. Observations form the observation space, π t
which is often represented by O . The probability of observation o given
state s is the emission probability ε = p(o|s).
3.2. Pipelines and categories
Action. Action a (a.k.a. u in some papers) is the movement of a
robot, which is usually driven by joint angles or torques in an industrial
Pipelines of machine learning, deep learning, reinforcement
robot. Joint angles generate motion in joint space and Cartesian space
learning, and robot learning are summarised in Fig. 6. In standard ma­
under the robotic kinematics. The action also can be the selection of task
chine learning, feature engineering is employed on the raw data to
procedures in the manufacturing settings. Actions form the action space
extract the features. Benefited by the enriched capability of represen­
(continuous or discrete), which is often represented by A .
tation of deep architectures, deep learning performs end-to-end. Stan­
Policy. Policy π : S →A is the mapping from the state space to the
dard reinforcement learning also relies on features, resulting in a weaker
action space. For partial observation conditions, it is the mapping from
representation of high dimension data like images, while deep rein­
the observation space to the action space (π : O →A ). It is a conditional
forcement learning shows the power to be trained end-to-end such as
probability distribution over the action space given observation or a
deep Q networks (DQN), policy gradient (PG) and actor-critic (A-C)
maximum function on the state-action values. Formally, it can be any
learning. On the other hand, traditional robot programming is realised
function, rule or even table. Mathematically, it is represented by π(a|s)
in a procedure-oriented style, for instance, the RAPID programming
or π(a|o). θ is utilised as the parameters of this policy, whatever the
language for ABB industrial robots. In this way, a start point and an end
shape of the function is, which makes policy πθ . The policy can be
point are required with their gesture. Motion trajectories are calculated
learned directly or based on other learnable functions like the value
by robot controllers internally and then performed in an open loop.
function.
Robot learning focuses more on robotic stresses an advanced closed-loop
Reward. Reward r(s, a) : S × A →R ∈ R is the signal feedback from
approach. End-to-end policy receiving fresh observation at a higher rate
the environment indicating how good an action is. It is a signal designed
makes the sequential robotic actions more flexible.
oriented to reinforcement learning. The reward can be positive or
The category of robot learning is based on several conditions of the
negative due to the problem settings. In some research, similar signals
key procedures, as shown in Fig. 7. Those key elements are also from
are called cost (c in some papers) instead of the reward. Accumulated
multiple disciplinary, mainly reinforcement learning. These three
rewards (a.k.a. return) R is exactly the target that one agent is designed
essential procedures run one after another. A) Sample generation is the
to maximise, while accumulated cost C is the target to be minimised.
procedure where the current policy is utilised on real robots or simu­
Reward signals also form a reward space R .
lators to have enriched data to estimate returns. B) Return estimation is
Environment. The environment is the place where the agent acts.
to calculate the return from samples to formulate the target function for
Mathematically, it is a transition probability on sequential states
policy improvement, while model approximation is utilised in model-
conditioned by action. It is given by p(s |s, a) where s is the successive
′ ′

based robot learning when the dynamic model of the environment is


state of s. To have a consistent formulation with other spaces, it is
known by the user. C) Policy improvement is the procedure to improve
usually represented by T , which is also called the transition operator.
the performance of a policy mapping observation to action to get more
The transition probability of the environment is also known as the dy­
rewards. Repetition of these elements constitutes the whole iteration of
namics model. The utilization of such a model leads to model-based
robot learning.
learning and model-free learning.
Robot learning methods can be categorised with different perspec­
Trajectory. Trajectory τ is the record of the state, action, and reward
tives of robot learning, as summarised in Table 2. Firstly, it can be cat­
in the perspective of an agent from the initial state to the terminal state
egorised into imitation learning and reinforcement learning according to
in (3.1). One trajectory is also known as one episode. There is no ter­
the learning fashion. Imitation learning or learning from demonstrations
minal state in the infinite Markov decision process where one episode
lies in supervised learning, while reinforcement learning with the trial-
needs to be stopped manually. Infinite MDP also requires the process to
and-error approach is utilised in the occasions where expert data are not
be stationary; in this condition, all states can be reached and revisited.
available. Based on the utilisation of the environment model, robot
The probability of one trajectory is given in (3.2). It is indicated by the
learning can be divided into model-based learning and model-free
initial state, the environment with the Markovian property and the
learning. After the approximation of the environment model, it can be
policy. If the policy is parameterised by θ, it can be represented by πθ (τ).
This terminology is not the same one in robotics, so we use motion
trajectory to indicate the trajectories of the TCP on an industrial robot.
τ = s1 , a1 , r1 , s2 , a2 , r3 , ⋯, sT− 1 , aT− 1 , rT− 1 , sT (3.1)


T
π (τ) = p(s1 ) π(at |st )p(st+1 |st , at ) (3.2)
t=1

Markov Decision Process. Markov decision process is a stochastic


process for sequential decision-making problems defined by a tuple 〈S ,
A ,T ,R 〉, otherwise 〈S , A , O , T , R , ε〉 for partially observable Markov
decision process (POMDP). In MDP, the Markovian property of states is
the primal hypothesis. For one agent, the optimal policy is expressed in
(3.3). Robot learning with reinforcement learning is exactly pursuing
the expectation of the maximum accumulated rewards by tuning the
Fig. 5. The structure of the Markov decision process.

6
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

utilised for policy improvement, which is the paradigm that optimal


control follows [51]. Model-free learning is under the condition when
the environment model is hard to obtain, for instance, an environment
with uncertainties [52]. Model-based and model-free learning can also
be tied together, for example, the Dyna architecture [53]. The difference
in sampling method results in on-policy and off-policy learning methods.
It depends on whether the policy to be improved is utilising samples
generated from other policies or itself. It can also be categorised by the
type of algorithms in reinforcement learning. Details of those methods
are reviewed in the following section.

4. Methodologies and principles


Fig. 6. The structure of pipelines of (deep) machine learning, reinforcement
learning and robot learning. In this section, we review the methodologies and principles of robot
learning, including imitation learning, value function learning, policy
gradient learning, actor-critic learning, and model-based learning.
Concretely, Section 4.1 stresses robot learning with supervised learning.
Sections 4.2 to 4.4 belongs to the domain of reinforcement learning.
Section 4.5 emphasises utilising the environment model for robot
learning. Comparisons of different methods and the relationship be­
tween decision-making and control are also given. Most of the papers
reviewed in this section have been publicly cited more than one hun­
dred, one thousand and even ten thousand times.

4.1. Imitation learning

Imitation learning is one decision-making method in robot learning


using supervised learning. It is also called learning from demonstration,
learning from a human teacher or behaviour cloning. The policy in
imitation learning is trained with expert data as the supervision signals.
Cost functions are derived directly from the distance between policy and
the expert data under the same states. Imitation learning also relies on
the Markov hypothesis on the states. Fig. 8 shows the structure of
imitation learning. Supervision signals are labelled manually then in­
tegrated into the dataset. Then supervised learning algorithms are
launched based on the dataset to improve policy. The policy can be a
Fig. 7. The structure of key procedures in robot learning algorithms.
neural network ending with a softmax layer over actions for discrete
action space. While for continuous action space, the policy can be
Table 2
Gaussian mixture models. This structure follows the logic in Fig. 7 but
Category of robot learning methods. varies on all the elements (A, B and C) in it (which is also summarised in
Table 2) from reinforcement learning. The policy can be utilised to
Perspectives Category Conditions Related
elements
generate new data for labelling again, which is called the dataset ag­
gregation (DAgger) method [54].
Type of learning Imitation learning or Expert data as the A+B+C
The advantage of imitation learning is the ease of use. For example,
learn from supervision signal.
demonstrations (LfD) NVIDIA utilised CNN to train the policy of AGV on outdoor roads [55].
Reinforcement No expert data. Domain knowledge like human driving skills helped the policy
learning improvement effectively. Giusti et al. [56] explored imitation learning
Model of the Model-free learning Model of the B on a UAV in forest trails. However, imitation learning often fails when
environment environment is
unknown.
manually labelled data is not enough when a new observation occurs or
Model-based learning Model of the human makes mistakes in the labelling.
environment is On the other hand, manually labelling is expensive, especially when
known. the dimension of action is enormous. Accumulated errors deteriorate the
Sampling On-policy learning Policy for sampling A
problem because they lead the agent to reach the observation that is
approach and improvement is
the same one. rarely in the dataset. Data drifting problem can be improved by adding
Off-policy learning Policy for sampling new data into the dataset, like the DAgger method with the on-policy
and improvement is
not the same one.
Type of algorithms Value function State value function B+C
in reinforcement learning and state-action value
learning function take the core
role.
Policy gradient Directly differentiable
learning the objective.
Actor-critic learning Combination of the
value function and
direct policy learning.

Fig. 8. The structure of imitation learning.

7
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

style in Fig. 8, while the function with sequential structure can improve function over πθ (at |st ). Overall, the REINFORCE optimization encour­
the compromise on the Markov hypothesis of the human supervision, ages the agent to perform more good actions while preventing it from
like LSTM on robotic manipulation policies [57,58]. Another short slab bad actions by deploying the gradient ascent.
is that humans are good at macroscopical orders like “turn right” or ∫ ∫
“move forward” but weaker at continuous numerical control like the ∇θ J(θ) = ∇θ πθ (τ)r(τ)dτ = πθ (τ)∇θ logπθ (τ)r(τ)dτ
joint angles on the motor and motion level of a robotic device. To solve
those problems, robots are expected to learn by themselves continuously = Eτ∼πθ (τ) [∇θ logπθ (τ)r(τ)] (3.6)
with unlimited data. Reinforcement learning is such a technology we ( )( )
will review in the field of robot learning. ∇θ J(θ) ≈
1∑ ∑ ( ⃒ )

∇θ logπθ ai,t si,t
∑ (
r si,t , ai,t
)
(3.7)
N i t t

4.2. Policy gradient learning θ←θ + α∇θ J(θ) (3.8)

As targeted to the accumulated reward, reinforcement learning can After the proposal of the naive REINFORCE algorithm, the policy
be solved from an optimization perspective. Policy gradient is exactly a gradient method has been enriched with various improvements for an
method of optimising the policy directly with the subject to the accu­ easier and more stable learning process. (3.9) indicates how to utilise
mulated reward in a model-free style. For a policy governed by pa­ causality and baseline b to reduce the variance. Concretely, b can be the
rameters θ, this optimization problem can be formulated in (3.4). In average reward in a batch of episodes.
shorthand, the target of the optimization problem is in (3.5). Clearly, ( )( )
1∑ ∑ ( ⃒ ) ∑ T
( )
J(θ) is conditioned on S × A × t, which makes it a huge space to be ∇θ J(θ) ≈ ⃒
∇θ logπθ ai,t si,t r si,t′ , ai,t′ − b (3.9)
N i
searched. Hereby, policy optimization has been transformed from (3.3) t ′
t =t

to (3.4), where the policy can be represented by deep neural networks. On the other hand, policy gradient is an on-policy method whose old
Policy gradient solves this challenge by approximating the argument- samples must be discarded after a new sampling iteration. However,
wise maximum function in (3.3). deep neural networks can only change slightly within one update step.
[ ] To deal with this inefficiency, a fast simulator or off-policy can be

θ∗ = argmaxEτ∼πθ (τ) r(st , at ) (3.4) adopted to policy gradient-driven deep reinforcement learning. Impor­
tance sampling is a frequently-used method for these cases [61]. It is a
θ t

skill that makes it possible for a new policy θ to be updated using



[ ]
∑ 1 ∑∑ (
J(θ) = Eτ∼πθ (τ) r(st , at ) ≈ r si,t , ai,t
)
(3.5) samples (tuples of states, action and reward) from an old policy θ. (3.10)
t
N i t shows a gradient calculation using the causality for importance sam­
pling.
Gradient ascent can be deployed on J(θ) in (3.6). r(τ) is used to
indicate the accumulated reward in one episode (a.k.a. trajectory),
which can be used in both discrete and continuous conditions. This

( ) ( ( ⃒ ))( T )
1∑ ∑ ( ⃒ ) ∏ t
πθ′ ai,t′ ⃒si,t′ ∑ ( )
(3.10)

∇θ′ J(θ ) ≈ ∇θ′ logπθ′ ai,t ⃒si,t ( ⃒ ) r s ′,a ′ − b
N i t ′ πθ ai,t′ ⃒si,t′
t =1

i,t
t =t
i,t

procedure can also be enhanced with common machine learning skills Recently, advanced policy gradient methods were enhanced com­
like momentum or ADAM methods. Based on (3.2), π θ (τ) in (3.6) is only bined with the research of value function learning and actor-critic
related to πθ (at |st ), which is exactly the approximator to be optimised. learning, which are introduced in the following sections. Efficiency
Hence, the gradient of the objective function can be derivated to (3.7). It and stability are two main issues that were studied at this time. In policy
indicates an unbiased estimation given samples generated by running gradient with deep neural networks as the policy, different parameters
the policy. After sampling multiple episodes, the policy can be updated in the network change the policy probabilities by a different amount.
with (3.8), where α is the learning rate. This paradigm forms what is Hence, the policy to be updated and the old policy used for sampling
called REINFORCE algorithm [59]. The structure of the policy gradient need to be bounded. To address this problem, natural policy gradient
method is summarised in Fig. 9. It is also called the Monte Carlo policy (NPG) [62] was firstly recalled to stabilise policy gradient training by
gradient (MCPG) [60] method, which refers to taking samples and then bounding two policies with Kullback-Leibler (KL) divergence.
summarising the rewards iteratively. Compared with imitation learning, Furthermore, trust region policy optimization (TRPO) [63] was
which is indeed a maximum likelihood problem, policy gradient utilised proposed with tuneable bound parameters in the learning rate in NPG.
the reward signal as the weight on the gradient of the logarithmic By doing so, policy gradient with direct optimization on the cumulated
reward has been transformed into a constrained optimization problem
where the Lagrangian function can be constructed. Then, dual gradient
descent can be instantiated to optimise it. If importance sampling was
used directly on the objective, the regulation would be conducted to
enforce two policies to stay closed. Proximal policy optimization (PPO)
was proposed later based on this idea [64].

4.3. Value function learning

Unlike policy gradient, value function learning utilises value func­


Fig. 9. The structure of the policy gradient method. tion to evaluate how good a state is with the agent following a policy.

8
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

Correspondingly, a robot agent can then update its policy by searching stochastic environment. Q-learning [67] was proposed to get rid of this
the action space for the maximum from the estimated value function. For issue through estimating the Q function, which takes state and action
deep reinforcement learning using value function learning, it in­ together, instead of only the V function. Once the Q function is esti­
stantiates deep neural networks for value function rather than policy. mated, the maximization can be conducted on it with different actions as
According to the input, those networks vary from CNN [52] to the latent the input.
codes in a neural generator [65]. Overall, it eventually stresses the The structure of Q-learning is illustrated in Fig. 10. It can be
maximum cumulated reward, which is the universal target of rein­ instantiated with fitted Q-iteration or online Q-learning [68]. Fitted
forcement learning. Q-iteration utilised (3.18) to iterate the Q values, which also plays the
State-action value function (3.11) and state value function (3.12) are role of supervision in (3.19). Then, the Q function with parameters ϕ will
the backbones of such a method. State value function (a.k.a. V function) be updated with (3.19). Fitted Q-iteration also follows the off-policy
represents how good a state is when a robot follows a policy, while state- style, which means the samples fitting the Q function can be gener­
action value function (a.k.a. Q function) values how good an action is ated from different policies. Fitted Q-iteration varies from Q-learning
when the robot is at a deterministic state following a policy. Those value due to the iteration mechanism of crucial procedures. In Fig. 10, N
functions can be estimated by the Monte Carlo trial-and-error paradigm represents the number of sampling iterations while K shows the circle of
running prior policies of the robot agent. For discrete and small scale Q function iteration and estimation. For fitted Q-iteration learning, K is
state and action spaces like training a mobile robot escaping from a 2D set into the inner loop with a higher value than N. If N = K = 1, fitted
maze, value function can be represented by tabulations. These tasks Q-iteration becomes an online learning fashion. However, online
were also the place where value function learning began. learning does not suit neural networks well, and the reason is explained
[ ] in actor-critic learning. Another problem of Q-learning is the “explora­

T
Qπ (st , at ) = Eτ∼π(τ) r(st , at ) (3.11) tion and exploitation” trade-off. Epsilon greedy or Boltzmann explora­

t =t tion are tools often used.

(3.18)
′ ′

V π (st ) = Eat ∼π(at |st ) [Qπ (st , at )] (3.12) yi ←r(si , ai ) + max



a
Qϕ (s i , a i )

At the beginning of the value function research, policy iteration or


1∑
value iteration using dynamic programming are the typical solutions for ϕ←argmin ‖yi − Qϕ (si , ai )‖2 (3.19)
2 i
discrete settings like 2D mazes [66]. Eq. (3.13) shows the bootstrapped
ϕ

update of the V function in policy iteration. Eqs. (3.14) and (3.15) show What needs to be pointed out is that what is optimised by fitted Q-
the similar processes of the V and Q functions in value iteration. For a iteration and Q-learning is the squared error defined in (3.19), which is
more complex task with the broader state space like image and video, also called the Bellman error. It reflects that value function learning is
function estimators were later employed for value functions, resulting in not a direct gradient method. If the value function is represented by a
fitted value iteration. The fitting phase of those function estimators table, value function learning can be proved to converge. However, if it
follows the supervised regression style, taking the distance in (3.16) as is fitted by neural networks, this learning method loses the guarantee of
the supervision signal. For the policy improvement in value function convergence. There is a trade-off between convergence and represen­
learning, it can be updated by indexing actions in the Q function to find tation capability. No mathematical guarantee of convergence and
the maximum of the Q value (3.17). This binary policy makes the naive correlated sequential samples are the main problems of Q-learning. To
value function learning with deterministic policy. cope with the time correlation problem, the replay buffer was designed
to be a wider container for samples from the beginning and the end.
(3.13)

V π (s)←r(s, π (s)) + Es′ ∼p(s′ |s,π(s)) [V π (s )]
However, the convergence issue cannot be solved but only mitigated for
value function learning that uses neural networks. One skill is using the
(3.14)

Qπ (s, a)←r(s, a) + Es′ ∼p(s′ |s,π(s)) [V π (s )] target network as a temporal copy of the fitted value function, decou­
pling the training target and the neural network calculating Q values.
V π (s)←maxQπ (s, a) (3.15) What is constructed by replay buffer and target network is the so-called
a
deep Q-learning network (DQN) [52].
1∑ Additionally, techniques like Polyak averaging, priority replay buffer
L (ϕ) = ‖Vϕ (s) − maxQπ (s, a)‖2 (3.16)
2 i a [69] also benefit the training of value function learning methods with
⎧ deep neural networks. Similar skills like double deep Q network (DDQN)
⎨ 1 if at = argmaxQπ (st , at ) [70] and multi-step returns (MSR) [71]are proposed to solve the over­
π(at |st ) = at
(3.17) estimation problem. At the same time, importance sampling can also be

0otherwise employed to transform on-policy MSR into off-policy MSR. Duelling
network architecture was proposed to relieve the policy evaluation with
However, policy iteration, value iteration and fitted value iteration
similar-valued actions [72]. Nevertheless, DQN has shown an excellent
assume that all the transition dynamics were known when calculating
capability of artificial intelligence in challenging tasks like video games
the state value. It is because the agent needs to arrive at the same state
[52] and the Go competition [73] though it faced the issues reviewed
during the Monte Carlo evaluation to get the maximum Q value amongst
all the actions in Eq. (3.15). It further deteriorates when the task is in a above.
On the other hand, continuous action cannot be realised for any
value function learning using tables or neural networks with policy
defined in (3.17). For continuous action space, calculations like argmax
or max cannot be conducted directly because the searching space is
infinite. To make value function learning available for continuous action
space, one method is to use a function class that is easier to be optimised,
for instance, normalised advantage function (NAF) [74]. The other
feasible way is to make the calculation related to maximum wrapped or
replaced, just like in policy gradient learning. Techniques like
gradient-based optimization (e.g. stochastic gradient descent) or sto­
Fig. 10. The structure of the Q-learning method.
chastic optimization can be deployed upon the maximum calculation

9
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

here when the dimension of the action space is relatively low. Mean­ with bias. Recently, several attempts were proposed to make the AC
while, a function approximator can be set with machine learning to take method lower variance but with no bias, like Q‑prop [79] and gener­
the place of the raw maximization. At this stage, the value function met alised advantage estimation (GAE) [80].
policy gradient learning, and the deep deterministic policy gradient
̂ π (st , at ) = r(st , at ) + V
̂ πϕ (st+1 ) − V
̂ πϕ (st ) (3.22)
(DDPG) method was proposed [75]. DDPG is exactly a policy gradient A
learning method that utilises replay buffer and target network in DQN to ( )
enhance stability. 1∑ ∑ ( ⃒ ) π( )
∇θ J(θ) ≈ ⃒
∇θ logπθ ai,t si,t  si,t , ai,t (3.23)
Value function-based robot learning plays an increasingly important N i t
role in robotic applications. For instance, Sascha et al. [65] utilised value
function learning for AGV control, which takes the image of the trace of Apart from what has been introduced above, discount factor γ ∈ [0, 1]
AGV as the input. Gu et al. [76] investigated the 3D manipulation is often used in the long horizon or cyclical tasks [81]. It is a factor
challenge on real physical robots with asynchronous learning amongst deployed on the future reward. It is also set to guarantee that the
multiple industrial robots. It provided value function learning that cumulated reward will not diverge and the agent will focus more on the
works well on physical industrial robots and is also feasible for robots to reward in the nearer future. In this case, the supervision signal for
share their knowledge during robot learning. Kalashnikov et al. [77] state-value function estimation will be constrained by the discount
proposed a framework called QT-Opt, which is a scalable learning factor in (3.24). Accordingly, the advantage in (3.22) and the gradient in
approach using Q-learning. It finally realised closed-loop vision-based (3.23) need to be enriched with a discount factor in (3.25) and (3.26).
robotic grasping with 96% success on unseen objects, and the case Since the reward within neighbouring timestamps has a lower variance
studies were given on KUKA iiwa industrial robots. than the reward of the timestamps in the far future, it is designed
exponential with timestamp t in policy gradient. Here, since the
advantage function only calculates the difference of Q function and
4.4. Actor-critic learning
state-value function at a single timestamp, the exponent on γ equals 1.
( ) ( )
Actor-critic (A-C) learning is derived from the policy gradient ̂ πϕ si,t+1
yi,t ≈ r si,t , ai,t + γ V (3.24)
method with inspiration from value function learning. It utilises a critic
to pursue an enhanced estimation of the accumulated reward ̂ π (st , at ) = r(st , at ) + γ V
A ̂ πϕ (st+1 ) − V
̂ πϕ (st ) (3.25)
∑T
t′ =t r(si,t , ai,t ) in the last term of (3.9), a.k.a., “reward to go” [78]. The
′ ′
( )
critic can be exactly the Q function, the state-value function or an 1∑ ∑ ( ⃒ ) ( ( ) ( ) ( ))
advantage function that is defined in (3.20), which estimates how much ∇θ J(θ) ≈ ∇θ logπθ ai,t ⃒si,t ̂ πϕ si,t+1 − V
r si,t , ai,t + γ V ̂ πϕ si,t
N i
better an action is when the agent is at the state st following a policy π . t

Hence, it is also called policy evaluation in some papers. For instance, (3.26)
deep neural networks can be used to fit the state-value function. In this With the discount factor embodied in the state-value function esti­
case, the deep neural network with parameter ϕ takes the state as the mation and gradient ascent on policy network, actor-critic can be used in
input and generates the state value of it. The dataset an on-policy fashion. The online actor-critic learning method is illus­

{si,t , yi,t = Tt′ =t r(si,t′ , ai,t′ )} for this training is exactly the samples from trated in Fig. 12, which is a method with dataset absent. However, it can
running the prior policy and loss function of the state-value function be applied to the online learning style, and every step of the agent will
estimation follows a supervised regression style in (3.21). Unlike policy update its policy. On the other hand, it is also due to the absence of rich
gradient ascent, gradient descent should be applied to such a loss datasets, function estimators with neural networks cannot be utilised
function. Sometimes, a bootstrapped estimation can also be utilised directly in the critic function, which results in a weaker capability of
whose dataset is established with {si,t , yi,t = r(si,t , ai,t ) + V ̂ π (si,t+1 )}. representation of it. Recently, synchronous parallel actor-critic learning
and asynchronous parallel actor-critic learning (A3C) [82] have been
ϕ

π π π
A (st , at ) = Q (st , at ) − V (st ) (3.20) proposed to address this problem, with simulators or robots running
simultaneously shown in Fig. 13.
1 ∑ ̂π Overall, actor-critic learning absorbs value function learning and
L (ϕ) = ‖ V ϕ (si ) − yi ‖2 (3.21)
2 i policy gradient learning more comprehensively and precisely. Later,
actor-critic learning was further enhanced on stability and performance,
Combined with policy evaluation using state-value function estima­
such as twin delayed deep deterministic policy gradient (TD3) [83], soft
tion, the complete architecture of the batch actor-critic learning can be
Q-learning (SQL) [84] and soft actor-critic (SAC) [85].
summarised in Fig. 11. Since the fitted state-value function can take any
state as the input, the advantage defined in (3.20) can be isolated from
the accumulated reward (3.22). It leads to the loss function in actor- 4.5. Model-based learning
critic learning shown in (3.23). This results in a lower variance of the
policy, which helps the training phase with a more stable learning curve. Imitation learning, reinforcement learning, including policy gradient
In this way, actor-critic learning is also named advantage actor-critic learning, value function learning and actor-critic learning in former sub-
learning (A2C). However, due to the trade-off between bias and vari­ sections are deployed model-freely. It means that the state transition
ance, using the estimated state-value function will lead to estimation probability p(s |s, a) of the environment is not requisite in these methods.

Fig. 11. The structure of the batch actor-critic learning method. Fig. 12. The structure of the online actor-critic learning method.

10
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

optimization, or model predictive control (MPC) [90,91].


Nevertheless, MPC methods will be helpless when the exact model is
absent. Model-based reinforcement learning is the technology that sol­
ves a sequential decision-making problem by fitting the model of the
environment and then planning through it. The objective of model
fitting can be minimising the distance between the trained model and

new samples, like ‖ f(si , ai ) − si ‖2 . MPC and iLQR can be added onto

i
the learned model to re-plan and act. For the type of the model to be
trained, Gaussian processes (GPs), Gaussian mixture model (GMM),
specific physics model (e.g. parameter identification of a robot manip­
Fig. 13. The structure of parallel online actor-critic learning method. ulator), or deep neural networks can be utilised. A GPs based learning
method has been tested on robotic manipulation at an early stage [92].
Monte-Carlo sampling solves the calculation of the mathematical ex­ In addition to learning the global model, local model methods were later
pectations. In other words, those methods can be solved by data rather studied to relieve the training of the global model, especially when it is
than the model. This category of learning methods shows significant represented by enormous deep neural networks. Local models stress the
learning capability when the environment model is hard to be estab­ local linearization of a global, making it easier to compute and also
lished, such as complex physical processes or the modelling of un­ viable to action in the real world. Similar ideas of trajectories bounding
certainties costs. The idea of model-free engineering is also well-known in TRPO were also employed here, transmitting the problem into con­
in robotic and automatic control, PID controllers, for instance. strained optimization. Dual gradients on the dual function derived from
However, if the environment model can be utilised in robot learning, the Lagrangian function solved this optimization together with iLQR.
it may boost the learning curve of the model-free learning method, or the This style of local model-based learning was realised contact-rich
decision-making problem can even be directly solved by utilising the manipulation on a PR2 robot with an assembly task. Since the optimi­
model. This is so-called model-based learning. In robotics, models are zation is constrained by trust region, it is also known as guided policy
often in kinematics and dynamics where the model is tight and clear. search (GPS) [93]. Fig. 14 shows different fashions of model-based
Sometimes, a model is not available but able to be estimated, such as the learning at a glance.
parameter identification of industrial robots [86]. For any sequential Taking the observation space into consideration, model-based
decision-making problem with a feasible environment model, the model learning can also be initiated with image signals like robot vision, just
f(s, a) (numerical output style of p(s |s,a)) can generate all the states in a

what can be realised in model-free learning methods. For observation
trajectory given the initial state. By maximising the reward signal on from cameras, the dimension is usually too large to be processed at a
every timestamp, this decision-making problem can be solved by opti­ higher rate, such as real-time system. Hence, learning in latent space
mising the actions along the trajectory. In this way, the decision-making using deep auto-encoders became viable, such as embedded control [94]
problem can also be regarded as optimal planning, trajectory optimi­ and visuomotor control of a robotic manipulator using CNN and spatial
zation or policy search since all the possible states can be conducted by softmax [95].
the environment model. The objective of this optimization problem can Apart from learning a model and planning through it, another
be summarised in (3.27). fashion is to learn a policy based on a model. If the policy can be learned
[ ] directly, re-plan in iLQR can be omitted, making it faster and better at
a1 , a2 , ⋯, aT = argmax E

r(st , at )|a1 , a2 , ⋯, aT (3.27) generalization. It is also contained in Fig. 14. To learn the policy using
a1 ,a2 ,⋯,aT t the environment model, the model needs to be derivable. It also makes
the decision-making problem a constrained form, but with one more
Stochastic optimization, the random shooting method, and the cross- constraint on the policy (3.30). One feasible solution is constructing the
entropy method (CEM) [87] are common solutions proposed earlier to augmented Lagrangian function of the primal objective and then opti­
handle this kind of optimization problem. Additionally, for discrete mising it by dual gradient and alternating direction method of multi­
action space with higher dimensions, the Monte-Carlo tree search pliers (ADMM) [96]. Another way is to take the idea from imitation
method (MCTS) has shown a significant capability to deal with a model learning but use the model to generate supervision signals rather than
with enormous searching space, like the Go game [73,88,89]. On the human experts, such as the PLATO algorithm on a flying control chal­
other hand, derivatives can be utilised on a continuous model, which lenge of a UAV in the forest [97].
means that gradients can be backpropagated from the objective through ∑
the model. From this perspective, the optimal planning can be solved by max r(st , at ) s.t. st = f (st− 1 , at− 1 ), at = πθ (st ) (3.30)
the shooting method with the view of unconstrained optimization
a1 ,a2 ,⋯,aT
t

(3.28), or by the collocation method with the view of constrained Lastly, the environment model can also be utilised in model-free
optimization (3.29). learning only if the model is disassembled from the backpropagation
max r(s1 , a1 ) + r(f (s1 , a1 ), a2 ) + ⋯ + r(f (f (⋯)⋯), aT ) (3.28) and planning. In this way, the model is employed as a simulator [98].
a1 ,a2 ,⋯,aT
This idea was originated from the Dyna architecture, which is also used
∑ for model-based acceleration [74].
max r(st , at ) s.t. st = f (st− 1 , at− 1 ) (3.29)
a1 ,a2 ,⋯,aT
t

From the perspective of a linear or nonlinear model, the linear-


quadratic restrictor (LQR) can be used to solve this optimal planning
problem with linear models. It is stable and robust in the deterministic
condition where st = f(st− 1 , at− 1 ). Dynamical differential planning
(DDP), iterative linear quadratic restrictor (iLQR) and iterative linear
quadratic Gaussian (iLQG) methods were investigated to be employed to
nonlinear models by approximating a nonlinear system as a linear-
quadratic system. A similar idea also occurred in the field of auto­
matic control, with the name of optimal control, online trajectory
Fig. 14. The structure of the model-based learning method.

11
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

Table 3
Comparison of reviewed typical learning methods.
Subsection Algorithm On/Off-policy Model-free/based Direct Gradient Guarantee of Convergence Key Paper

4.1 DAgger N/A Model-free ✓ ✓ [54]


4.2 REINFORCE On Model-free ✓ ✓ [59]
4.2 MCPG On Model-free ✓ ✓ [60]
4.2 NPG On Model-free ✓ ✓ [62]
4.2 TRPO On Model-free ✓ ✓ [63]
4.2 PPO On Model-free ✓ ✓ [64]
4.3 Tabular Q-learning Off Model-free ⨯ ✓ [66]
4.3 Fitted Q iteration Off Model-free ⨯ ⨯ [68]
4.3 DQN Off Model-free ⨯ ⨯ [52]
4.3 DDQN Off Model-free ⨯ ⨯ [70]
4.3 DDPG Off Model-free ✓ ✓ [75]
4.4 A2C On Model-free ✓ ✓ [79,80]
4.4 A3C On Model-free ✓ ✓ [82]
4.4 TD3 Off Model-free ✓ ✓ [83]
4.4 SQL Off Model-free ✓ ✓ [84]
4.4 SAC Off Model-free ✓ ✓ [85]
4.5 iLQR N/A Model-based ✓ ✓ [90]
4.5 iLQG N/A Model-based ✓ ✓ [96]
4.5 CEM N/A Model-based ✓ ✓ [87]
4.5 MCTS N/A Model-based ✓ ✓ [88]
4.5 GPS N/A Model-based ✓ ✓ [93]
4.5 MPC N/A Model-based ✓ ✓ [91]

4.6. Comparisons on characteristics supervised learning fashion. An exception is Q-learning using Q tables. It
also has a guarantee of convergence that can be proven mathematically.
This sub-section delivers terse comparisons based on key factors It should be pointed out that no guarantee of convergence does not mean
amongst different robot learning paradigms, including the sample effi­ the algorithm will not converge. As for local or global optimum, any
ciency, distinction between model-free and model-based robot learning, machine learning method using deep neural networks will face the
differences in gradient computing, algorithmic convergence, and the challenge of local optimum. Only model-based optimal control or
link between decision-making and control. Finally, the comparisons planning has the opportunity to reach the global optimum when the
amongst typical methods are summarised in Table 3. model is exact and can be solved analytically. What needs to be noted is
that the properties of direct gradient computing, the guarantee of
4.6.1. Sample efficiency convergence and optimum cannot indicate that one method will
Sample efficiency indicates the utilization of the sample. Using a certainly dominate another one. It needs to be assessed on a case-by-case
batch of samples only once and then dropping them is clearly less effi­ basis.
cient than reusing them for later training epochs. Off-policy learning
methods, which can use replay buffer or sample database for re-training, 4.6.4. Decision-making and control
have higher efficiency than pure on-policy learning methods. Offline First of all, a part of model-based learning methods can also be called
learning methods, which can utilise data from prior episodes generated optimal control in the automatic control field. Concretely, model-based
from the old policy, are more efficient at sample efficiency than the learning emphasises the learning of the dynamic model. As for model-
online learning method. free learning methods, they can also be regarded as a control problem
by thinking policy as the controller, who takes the observation from
4.6.2. Model utilization sensors and generate actions for an agent. In the community of robotics
The distinction between model-free and model-based learning and automatic control, iterative learning control (ILC) [99] also takes
methods is whether the environment model is contained in the back­ the idea of utilising data from prior repetition. In all, if the
propagation of policy or not. The environment model indicating the decision-making runs at a high rate on the motor level and the motion
dynamics of the targeted system is not embodied in model-free learning. level, it looks just like the continuous control of a robot. However, for
However, the dynamic model can take the role of a simulator in model- task-level robot learning, decision-making and control on robotic de­
free learning without changing the style of the model-free algorithm. In vices vary. It is mainly divided by the rate of decision refreshment.
model-based robot learning, using a model in the backpropagation of
policy is the most direct way. On the other hand, if the decision-making 5. Training tools and benchmarks
is solved by planning through the fitted model instead of using the
trained policy, it belongs to optimal control, optimal planning or tra­ Robot learning, a branch of machine learning technology, can take
jectory optimization. advantage of any automatic differentiation tools, like TensorFlow or
PyTorch. Moreover, it also requires physical robotic devices or simula­
4.6.3. Gradient computing, guarantee of convergence and optimum tors, which are not contained in standard machine learning.
Policy gradient and actor-critic learning, which has the cumulated Physical robotic devices for robot learning can be any well-designed
reward as the supervision signal on the policy training, are in the style of robotic system, including industrial robots, collaborative robots, AGV,
direct gradient computing. Meanwhile, value function learning utilised or UAV [58,65,76,77,92,96]. The advantage of utilising real physical
value function to update the policy is not direct gradient computing robots is that all the data are from real devices. The disadvantage is also
when deep neural networks are employed for value functions. Model- apparent. Deep reinforcement learning algorithm may require a large
based robot learning by policy gradient and planning through the amount of data to achieve good enough performance on real robots
model is direct gradient computing. If a method is with direct gradient [100]. Besides, Robots need to be set back to the initial conditions again
computing, it will hold the guarantee of convergence. This is because and again during the training, which requires quantities of preliminary
when backpropagation with the error is focused, it follows the engineering or manual power. Another issue is related to safety. The

12
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

twin for industrial scenarios brings great potential for future work.
A benchmark that can be commonly adopted is significant for a
specific domain [109]. As for reinforcement learning, both OpenAI Gym
and MuJoCo have their typical case for benchmarking. Another famous
benchmark is the Atari Games. A large number of milestone papers were
validated on it. Nevertheless, there are few benchmarks for production
and manufacturing research. One reason may be that robot learning just
started in manufacturing research, and the infrastructure needs to be
completed. Another practical reason is that the tasks in production and
manufacturing vary significantly. The same task, like assembly, also
varies due to different products. Researchers who have different tasks
shall face the challenge to compare their methods. More analysis and
details will be concluded in the next section.

6. Industrial applications

This section reviews recently published practical and typical papers


under the scope of robot learning towards manufacturing or industrial
robotic devices. Only the papers with concrete case studies on industrial
robots or in the venue of production and manufacturing research were
selected. We categorised those cases into four sub-sections: robotic
grasping, assembly and disassembly, process control, and human-robot
collaboration. Furthermore, we classified the listed application into
motion level and tasks level, considering the definition of action. Only
when action space mirrors the joint space of a robot or the configuration
of the tool in Cartesian space will the application be seen as on the
motion level. Other applications, which have actions like an order of
pre-defined points or coordinates in Cartesian space, were regarded to
be on the task level since they are not tightly linked to the kinematics of
any robot.

6.1. Robotic grasping

Grasping objects precisely is the indispensable capability of robots,


especially the industrial robots on the shop floor. Today, it is still a
Fig. 15. Well-known simulators for robot learning. (a) and (b) are from OpenAI
significant and tricky task focusing on several works [110]. Typical in­
Gym, (c) is from [103], (d) and (f) is from Bullet, and (e) is from [105].
dustrial applications are listed in Table 4. Mahler et al. [43] proposed a
CNN and DQN based method for robotic pick and transport on the ABB
exploration of policy may lead the robot to the condition where danger
Yumi robot. Moosmann et al. [111] studied a DQN based separating task
may happen to humans or the robot itself. Those dangers are often
on a PANDA robot. Berscheid et al. [112] utilised model-based learning,
unacceptable.
taking RGBD images as the observation for a pick and move task. Levine
Simulators were established to overcome the shortcomings and
et al. [113] studied learning-based real-time robotic grasping on a vast
improve training efficiency. Calculations in simulators can run at a high
scale. Their case studies were conducted on two robotic lines, one has 14
rate on computers and even in parallel on computer clusters. Both
robots, and the other has eight robots. Mohammed et al. [114] and Wang
model-free and model-based robot learning can be employed on simu­
et al. [115] employed DQN and Q-learning on UR industrial robots. They
lators. Additionally, because of the strict kinematics and dynamics in
took the RGBD information as the input and generated the target posi­
robotics, the learned policy often can be easily generalised to real robots.
tion from the learned policy. Yao et al. [116] also investigated a similar
Well-known simulators are listed in Fig. 15, such as OpenAI Gym [101],
method but on a robot named DOBOT.
MuJoCo [102,103], Bullet [104,105] and RLBench utilising V-REP
From the listed applications, we can see that most robot learning-
[106].
driven grasping actions take vision signals as the observation. It re­
In production research and manufacturing technology, using the
quires the learning model to have a powerful capability of representa­
robotic simulator for the design validation and programming of robotic
tion. Hence, most of the works are based on deep reinforcement learning
manufacturing systems has a long history. For instance, ABB launched
or instantiated deep neural networks like CNN to handle the images. The
RobotStudio in 1998, which is able to simulate various robotic devices
result reflects the mighty strength of robot learning-based grasping. By
on the shop floor today. KUKA also has similar production called KUKA.
this means, the robot can pick objects in dense clusters with a favourable
Sim. In robot operating system (ROS), Gazebo is another powerful tool
success rate. Methods like Marius et al. [111] show great potential in
for robotic simulation, which is further built with physical engines.
complex industrial scenarios, where the objects are often entangled with
However, due to the production line of different vendors and their
each other. Some of the work did not indicate the task of manufacturing
commercial targets, those industrial software tools were not initially
but showed a clear and promising combination with manufacturing
oriented to robot learning. Nowadays, the idea of the digital twin
scenarios in the future.
returned after its first proposal in 2002 [107,108]. Though it has
different names in computer science and computational graphics and
6.2. Robotic assembly and disassembly
has been studied for decades, it also provides a promising future in smart
robotic manufacturing by enhancing the capability towards the training
Assembly is the key process of modern manufacturing and produc­
requirements. Generating data at a large quantity and at a fast speed is
tion. Robot learning was also explored to enhance the performance of
pivotal for data-driven methods, like model-free robot learning. Digital
the assembly. Typical industrial applications are listed in Table 5.

13
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

Table 4
Typical industrial applications on robot learning-driven robotic grasping.
Year Task Method Robot Observation Action Level

2019 Pick and transport [43] Supervised CNN and DQN ABB Yumi Depth image Grasp action Motion
2021 Separate entangled workpieces DQN A PANDA robot with a two-finger gripper Depth image Movement for Task
[111] separation
2021 Pick and move [112] Model-based robot learning 6 DoF robotic manipulator with a two-finger RGBD image 6 DoF motion Motion
gripper
2018 Real-time hand-eye coordinated Supervised CNN and model- A line of 14 industrial robotic manipulator, and RGB image 7 DoF motion Motion
grasping [113] based learning a line of 8 KUKA iiwa
2020 Pick and place [114] DQN UR robot RGBD image Grasp location Task
2021 Grasp for assembly [115] Q-learning UR5 RGBD Grasping position Task
heatmap
2021 Pick by vacuum [116] DQN DOBOT RGBD image Spatial Task
coordination

Table 5
Typical industrial applications on robot learning-driven robotic assembly and disassembly.
Year Task Method Robot Observation Action Level

2019 Steam cooker assembly Q-learning UR10 with SCHUNK two- Results from human action and Sub-procedure Task
[117] finger gripper object recognition
2015 Torque converter assembly Model-based ABB IRB4400 Assembly performance Assembly parameters Task
[118] learning
2014 Peg-in-hole assembly [119] Model-based ABB IRB120 and ABB Position in a grid Four types of movements Task
learning IRB140
2013 Peg-in-hole assembly [120] Model-based ABB IRB120 and ABB Position in a grid Four types of movements Task
learning IRB140
2019 Cranfield assembly Q-learning PANDA Labelled state of objects Hierarchy of sub-procedure Task
benchmark [121]
2019 Toy car assembly [122] Imitation learning ABB IRB140 Position and orientation of objects Motion trajectory Motion
and GMM
2018 Valve body assembly [123] Value function ABB IRB120 and ABB Three offsets and value Six directions and values Task
IRB140
2020 Square peg-in-hole Imitation learning A 6-DOF industrial Force and position Desired position and Motion
assembly [124] and DDPG manipulator velocity
2015 Pick-and-place / assembly Imitation learning KUKA LBR iiwa Not given. Low-level task profiles Task
[125]
2019 Case assembly [126] Imitation learning 2 UR robots States of objects from a vision Sub-procedure and motion Motion +
and GMM system trajectory Task
2019 Circuit breaker assembly DQN KUKA iiwa Force and torque Rotate along three axis Motion
[127]
2018 Peg-in-hole insertion [128] GPS UR5 Force, torque and robot state Reference to an admittance Motion
controller
2019 Gear assembly [129] iLQG Sawyer robot Force, torque and robot state Reference to a force Motion
controller
2019 Computer assembly [130] TD3 Mitsubishi MELFA RV-FR Joint angles and angular velocities Angular velocities Motion
robot
2021 Peg-in-hole assembly [131] Imitation learning Elite EC75 Force and torque Threshold of compliance Task
and GMM force
2019 E-waste unscrewing Q-learning UR5 Force, joint angle, and position of Seven types of movements Task
disassembly [132] the end-effector

Akkaladevi et al. [117] employed tabular Q-learning with the reward proposed a hybrid robot learning for robotic assembly combined with
signal from the human collaborator. Chen et al. [118] utilised a the learning on the motion level and the task level. Li et al. [127] utilised
model-based robot learning method with Gaussian process regression DQN and ROS Gazebo simulators to train a KUKA iiwa robot to
and Bayesian optimization. It updates the model based on prior assem­ accomplish a circuit breaker assembly task. Luo et al. [128,129] focused
bly performance and then uses the model to tune seven assembly pa­ on the contact-rich robotic assembly by taking force and torque into the
rameters. Cheng et al. [119,120] discretised the observation space into a observation and tried model-based GPS and iLQG methods. Ota et al.
grid and the action spaces into four movements and modelled the [130] studied the motion trajectory in an unknown environment using
decision-making problem as a partially observable Markov decision TD3 with RRT as the reference and tested their method with a computer
process. De Winter et al. [121] utilised Q-learning and a hierarchical assembly task. Song et al. [131] investigated imitation learning with
task graph for robotic assembly. Duque et al. [122] proposed imitation GMM on a peg-in-hole assembly. They utilised GMM as an estimator of a
learning and GMM based robot learning for trajectory generation in a force threshold before the inverse kinematic solver, rather than learning
toy car assembly task. Hong et al. [123] investigated value function a direct policy. The controller they used is still a PID controller, making
based learning for valve body assembly. In this study, the observation the learning on the task level. Robotic disassembly stressed the inverse
space is the offset conducted from the robot vision system, while the process of assembly. Kristensen et al. [132] studied a robot simulation
action space is the direction and distance of the end effector in Cartesian framework for e-waste disassembly with Q-learning.
space. Kim et al. [124] studied robot learning for square peg-in-hole From the listed application, we can see various robot learning
assembly. It linked imitation learning and DDPG together, which is methods have been tried on robotic assembly, from model-based to
unusual in robot learning. Ko et al. [125] also put up the idea that model-free ones. Those methods were initiated on various industrial
imitation learning can help robotic assembly. Kyrarini et al. [126] robots from different vendors and systems. Due to the restriction from

14
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

the vendor, it is rare to find motor level robot learning in industrial located workpieces by embedded RFID tags and then learned how to
settings because of the closed wrapper on the industrial controllers. For arrange sub-procedures. Imtiaz et al. [138] studied the pick and place
contact-rich assembly tasks, force and torque are common observations using Q-learning on the conveyer without robot vision. By doing so, they
with the perception from torque sensors. For assembly planning, action have to pre-define a series of positions and poses and then learn a policy
spaces are often reflected in sub-procedures or hierarchies. Recently, to select. The motion planning of the robot is produced by other motion
more papers stressed deep robot learning methods like DQN, DDPG and planners. Jaradat et al. [139] deployed tabular Q-learning for a mobile
TD3, as well as the employment of simulators for training. It needs to be robot navigation task. This work was published earlier when deep
indicated that papers in disassembly are rather fewer than in assembly, reinforcement learning was not well-established. It defined observation
which shows a significant gap between robotics and manufacturing and action space by pre-defined alternatives and solved the navigation
research. on the task level. Sichkar [140] also investigated Q-learning based
mobile robot path planning in a discrete map. Liu et al. [141] proposed a
training framework with the background in cloud manufacturing. It
6.3. Robotic process control showed that robot learning technology has the potential to be imple­
mented by all means for the industry, such as on the cloud. Luo et al.
Apart from robotic grasping and assembly, robot learning was also [142] employed imitation learning for a robotic teleoperation task with
applied in other processes in manufacturing and production. Typical the collected trajectories and muscle stiffness from a remote interface.
industrial applications are listed in Table 6. Andersen et al. [133] pro­ Maldonado-Ramirez et al. [143] focused on the vision guided
posed a DDPG-driven robotic injection process with KUKA industrial path-following in robotic welding using PPO. Mueller et al. [144] used
robot. Brito et al. [134] put up the idea of using robot learning in imitation learning to teach a Sawyer robot to conduct a pouring task.
product quality inspection. Duguleana et al. [135] took the passive Tsai et al. [145] embodied Q-learning into the shoe manufacturing line
marker on the human hand as the observation and deployed Q-lear­ and used it for scheduling. It also employed other deep learning methods
ning-based obstacle avoidance in the simulator and real environment. like CNN to process the image of shoe tongues. Wang et al. [146] pro­
Fu et al. [136] focused on the deformable object manipulation by in­ posed more comprehensive research using PPO for a mobile manipula­
dustrial robots with DDPG in the garment manufacturing industry. tion task. The learned policy can generate action for the robot, mobile
Hameed et al. [137] investigated robot learning of task scheduling in a platform and the binary gripper. Wang et al. [147] used model-based
manufacturing cell containing multiple robotic manipulators. They

Table 6
Typical industrial applications on robot learning-driven robotic process control.
Year Task Method Robot Observation Action Level

2019 Brine injection [133] DDPG KUKA LWR 4 Mass of the object Injection pressure and time Task
2020 Quality inspection [134] AC UR3 Current positions Next positions Task
2012 Obstacle avoidance [135] Q-learning PowerBot with Passive markers Arm movements Motion
PowerCube arm
2021 Fabric manipulation [136] DDPG with UR5 State of the robot 4 movements Task
NAF
2020 Scheduling in robotic Imitation Adept Cobra i600 SCARA State of work-piece Sub-procedure Task
manufacturing cell [137] learning and ABB IRB1400
2021 Pick and Place without vision Q-learning JACO Coordinates, type of sub-procedure, Coordinates, type of sub- Task
[138] grasping pose, speed, position of object, procedure, grasping pose, and
and path. path.
2011 Mobile robot navigation Q-learning A mobile robot 4 pre-defined regions 3 pre-defined movements Task
[139]
2019 Path planning for mobile Q-learning A mobile robot A tabular map 4 pre-defined actions Task
robot [140]
2020 Robot training in cloud DQN UR5 RGBD image Coordinates Task
manufacturing [141]
2019 Teleoperation [142] Imitation Baxter robot Data from Touch X interface Motion trajectory Motion
learning
2021 Visual path-following in PPO KUKA KR16HW RGB image Movements on axes and angle Motion
welding [143]
2018 Pouring [144] Imitation Sawyer robot Not given. Motion trajectory Motion
learning
2020 Soft fabric shoe tongues Q-learning An industrial robot procedural item of the production line Scheduling rule Task
automation [145]
2020 Mobile manipulation [146] PPO 2 UR5 on a Husky mobile Position of the gripper and object, the joint Control actions for end- Motion +
platform position and velocity of the robot, and the effector, mobile platform and Task
gripper state gripper.
2015 Contour tracking [147] Model-based FANUC M-16iB Robot motor side encoders Torque and motor reference Motor
learning compensation
2020 3D motion imitation [148] Imitation UR robot Key points Position and velocity Motion
learning
2018 Learn stiffness regulation Imitation Baxter robot sEMG signals Motion trajectory and stiffness Motor +
strategies from humans learning profile Motion
[149]
2021 Robotic ultrasound scanning Imitation Baxter robot Force and trajectory Force and trajectory Motor +
[150] learning Motion
2017 Cooperative handling [152] Q-learning Adept Cobra i600 State matrix Pre-defined movements Task
SCARA-robot and ABB
IRB1400
2019 Cooperative handling [151] AC Adept Cobra i600 State matrix Pre-defined movements Task
SCARA-robot and ABB
IRB1400

15
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

learning for a contour tracking task but on the motor level. They utilised production line and used DQN to optimise factors like idle time and
a FANUC industrial robot which is available on the motor control for productivity. Rahman et al. [156] used a trust-based MPC to handle the
torque and motor reference compensation. Ye et al. [148] investigated task allocation for human-robot collaborative assembly in
imitation learning-based 3D motion imitation by indicating key points in manufacturing. The MPC was used to minimise the variation between
the Cartesian space by a human. Zeng et al. [149] proposed an imitation human and robot speed and maximise the trust value. Rozo et al. [157]
learning method for a stiffness regulation task. They also took the sEMG considered the collaborative lifting, transportation and assembly tasks
signals from the human muscle and trained the Baxter robot to cope with in human-robot collaboration. They initiated an imitation
the human. Zhang et al. [150] studied force and motion control of a learning-based method to generate robot motion. Sun et al. [158]
robotic ultrasonic scanning task. Imitation learning was employed on employed a CNN based on MobileNet to generate pre-defined actions.
the motor and motion levels. Schwung et al. [151,152] studied The learned network took dual image input from cameras and utilised
Q-learning and actor-critic learning for robot-robot cooperative human knowledge to supervise. Wang et al. [159] studied a
handling tasks. human-robot collaborative assembly task with imitation learning. In­
From the listed applications, we can summarise that robot learning verse reinforcement learning was also explored in this research. Wang
has landed in the domain of robotic process control, and many of them et al. [160] firstly used multimodal processing to estimate ten subclasses
are targeted towards production and manufacturing applications. of hand-over intentions and then took them as the observation. The
Unique manipulations like injection, scanning, obstacle avoidance, robot was trained to make decisions on accepting the delivery from the
fabric manipulation, pouring and path-tracking in welding witnessed human or not. On the path generation of a collaborative robot from
the help of robot learning and showed promising potential to be inte­ human demonstration, Wang et al. [161] proposed a complex method
grated with modern robotic production systems. Like the applications in embodying imitation learning, Q-learning, and simulated annealing al­
robotic assembly, most of the innovations are published within three gorithm, which is special in the domain. Yu et al. [162,163] imitated the
years. Methods with deep representation like DQN, DDPG and PPO are pipelines of Alpha Go and used a chessboard image to represent the task
continuously growing and contributing to society. states. They utilised CNN, MTCS and DQN to solve the task planning
issue. Zhang et al. [164] investigated the task allocation problem in
6.4. Industrial human-robot collaboration human-robot collaborative assembly using DDPG.
From the listed applications, we can discover that robot learning has
Human-robot collaboration is an innovative production paradigm contributed to both the motion level and task level in industrial human-
with the background of Industry 4.0 and Industry 5.0. It embodies the robot collaboration. On the motion level, force contact and stiffness are
perception of the robot, human, and the product together and makes often considered. They are usually linked to other controllers like the
decisions on the planning and control of the whole system [16]. Here, impedance controller. There is also research on direct learning of robot
robot learning also helps in the collaborative system, taking the coex­ motion, which generates the sequential joint angles. More studies occur
istence of the human and robot. Typical industrial applications are listed on the level of task allocation and planning. Some focused on the dy­
in Table 7. Liu et al. [25] focused on the safety issue in industrial namic and stochastic choice from the human, while some stressed the
human-robot collaboration and deployed DDPG into it. It took the allocation of sub-procedures. Though continuous algorithm like DDPG
obstacle of the human arm as the observation and generated the joint was deployed, most of the research on the task level takes discrete
angle of an ABB IRB1200 robot for real-time trajectory planning. Liu representation as the action space. On the other hand, human-robot
et al. [153] took the dynamic and stochastic properties of human collaboration in disassembly [16] is also an interesting topic, espe­
behaviour in the assembly into consideration and utilised DQN to drive cially when sustainability is considered. However, few papers applied
the robot to collaborate with the human on the task level. Meng et al. robot learning, which also indicates an opportunity in future work.
[154] studied a PPO-based variable impedance control for a cuboid
assembly task where the product was handled by both the human and 7. Open problems and research directions
the robot. Oliff et al. [155] employed DQN for a task planning task in a
tripolar production line. It listed seven key data points of such a This section summarises the open problems in robot learning and the

Table 7
Typical industrial applications on robot learning-driven industrial human-robot collaboration.
Year Task Method Robot Observation Action Level

2021 Safe interaction [25] DDPG ABB IRB1200 3D obstacle Joint angles Motion
2021 Dynamic and stochastic DQN Robot in a bolt-type roller and Task vector Sub-procedure Task
planning [153] gear system assembly task
2021 Collaborative cuboid assembly PPO UR robot Positions, velocities, forces, Impedance Motion
[154] quaternion of the gripper and object parameters
2020 Optimization in a tripolar DQN Not given. 7 data points 6 pre-defined action Task
production line [155]
2015 Subtask allocation in MPC Baxter robot Trust value Input speed for the Motion
collaborative assembly [156] robot
2016 Lifting, transportation and Imitation learning WAM robot and KUKA A set of task parameters (a number of Acceleration Motion
assembly [157] lightweight robot coordinate systems)
2020 Collaborative toy car assembly Imitation learning Staubli robot RGB images 8 pre-defined action Task
[158]
2018 Cylinder assembly [159] Imitation learning Staubli robot 12 assembly task states 12 pre-defined Task
optional actions
2021 Hand-over collaboration [160] Imitation learning Staubli robot 10 subclasses of hand-over Accept or not Task
intensions
2021 Path generation [161] Imitation learning UR5 Search parameters 4 pre-defined Task
and Q-learning movements
2020 Collaborative assembly [162] MCTS Robot in a desk assembly task Task chessboard Sub-procedure Task
2021 Collaborative assembly [163] DQN Robot in a desk assembly task Task chessboard Sub-procedure Task
2022 Collaborative assembly [164] DDPG Robot in an assembly task Discrete states of a robot, human, Selection of the Task
and execution progress target parts

16
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

possible research directions for robot learning-driven methods in the 7.7. Transfer learning and knowledge sharing
next generation of smart robotic manufacturing.
In robotic manufacturing lines and shop floors, there exist a mass of
7.1. Visuomotor control and planning robotic equipment. If the policy learned by one robot can be transferred
to the others, time and energy can be saved. Furthermore, the knowl­
Robot vision is one of the most direct perception methods for robotic edge of a single robot is possible to be shared amongst networks, such as
applications. Visuomotor control and planning take the real-time images industrial cloud robotics [169]. Transfer learning, meta learning and
from the robotic manufacturing cell and makes decisions. However, how federated learning (a.k.a. collaborative learning) are common archi­
to train robots integrated with images that represent a higher and richer tectures in machine learning. From the micro view, transfer learning and
dimension, how to process images with robust and real-time capability meta learning are promising to reuse data from other robots and tasks,
need to be further explored with the consideration of physical robotic while federated learning is able to handle the privacy preservation on
manufacturing settings in the real world [165]. knowledge sharing amongst different robots and end-users, or even
manufacturing enterprises and IT infrastructure suppliers from a macro
perspective. However, how to establish the knowledge sharing archi­
7.2. Stochastic and unstructured industrial scenarios tecture, how to manage the knowledge learned from different robots,
and how to justify which part of the knowledge is acceptable are quite
Stochastic conditions are inevitable in unstructured environments, challenging [170]. How to use the knowledge learned from a prior task
like complex industrial scenarios. How to estimate uncertainties and in a new one [171] is open to researchers, especially those in the field of
identify unknowns is still challenging [166]. Advanced production smart robotic manufacturing.
paradigm like industrial human-robot collaboration, which frees robot
from fences and restricted manufacturing cells, faces more uncertainties 7.8. Reward shaping and inverse robot learning
than conventional production methods. Learning to cope with stochastic
states from the human and product remains open and challenging. The rewards signal is the vital function that participates in con­
structing the loss or return function. Hence, the reward function needs to
7.3. Dimension of observation be shaped to fit the actual manufacturing tasks. On the other hand, in­
verse reinforcement learning (IRL) is a technology to figure out the goal
In robot learning, the dimension of observation varies from the ob­ of the task instead of copying the demonstration [172]. Precise reward
ject to be observed. Sometimes, various objects need to be observed shaping and IRL toward smart robotic manufacturing are still worth
simultaneously [167]. For manufacturing applications, how to choose studying.
proper representation (i.e. RGB images, depth images, vector of a task,
graphs, etc.) for observation and how to integrate them with the 7.9. Hardware and system integration
learning algorithms is another tricky issue. For different manufacturing
tasks, it requires domain knowledge and consideration. In addition to the industrial manipulators, robot learning on other
robotic systems towards smart robotic manufacturing like AGVs and
7.4. Industrial training systems and benchmarks UAVs remains open. How to generalise robot learning to more categories
of robotic systems in manufacturing requires more investigation.
Apart from benchmarks from computer science like Atari Games, Though robot learning can be used for multiple robotic systems, inte­
industrial society expects universal training systems and benchmarks grating learning systems, sensors, and robots is still tricky. In computer
which can replay the manufacturing task precisely. Not only do accurate science, one algorithm can be easily transferred to another computing
and specific tasks need to be contained in those systems, but also the platform. However, how to improve the tractability of robot learning in
training APIs need to be established. Open-source training systems and the manufacturing field expects engineering innovation on methods and
benchmarks are common in computer science. In contrast, the tools.
manufacturing research society needs more effort on clarity and
sustainability. 7.10. Concerns towards manufacturing application: standards and
scenarios

7.5. Sim-to-real gap: digital twins No matter how big or small, enterprises are the destination for robot
learning to be applied. Currently, most of the intelligent algorithms of
Training robot in simulators is easy and effective. However, if the robot learning stay in laboratories. How to wrap algorithms into solu­
simulator fails to represent the grand truth of the robotic task, there will tions, how to offer enterprises proper solutions, and how to deploy them
form distribution shifting leading to failure in the physical world, in manufacturing applications require efforts in international and in­
though the trained robot performs well in the cyber world. It is also dustrial standards. On the other hand, more industrial robotic scenarios,
called the sim-to-real gap in robot learning. Fortunately, manufacturing like robotic polishing and painting, are worthy of being explored with
research is now adopting digital twin technology [108]. Minimising the robot learning.
sim-to-real gap by digital twins is also an open area.
8. Final remarks
7.6. Multi-agent multi-task robot learning
Robots in manufacturing have been developed from nonprogram­
When standing at a higher perspective of robotic manufacturing mable to programmable. For the next generation of smart robotic
systems, multiple robots and tasks are possible to be processed together. manufacturing, the adaptation of uncertainties has been pinpointed as
It is called multi-agent multi-task robot learning in the society of arti­ an urgent requirement, which can also be named program-free robots.
ficial intelligence. Recently, this method showed great capability in e- Robot learning, like the ability of creatures, is increasingly expected.
sports where multiple agents were trained to perform as a virtual army The reason why reinforcement learning is so charming is the reward
while having respective tasks [168]. Deploying multi-agent multi-task mechanism behind it, which imitates the survival principles of universal
robot learning in robotic manufacturing factories is open and creatures on our planet. Learning by reward has shown significant
challenging. capability of robotic control, games, planning and recommendation

17
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

systems while having a unique link to optimal control and planning [20] W. Wang, Y. Chen, Y. Jia, Evaluation and optimization of dual-arm robot path
planning for human–robot collaborative tasks in smart manufacturing contexts,
research. Today, smart robotic equipment in manufacturing has taken a
ASME Lett. Dyn. Syst. Control 1 (2021), 011012, https://doi.org/10.1115/
role on the stage. Towards more dynamic and stochastic tasks, the ro­ 1.4046577.
botic devices are expected to learn and act like biological beings. Robot [21] S. Liu, L. Wang, X.V. Wang, Sensorless haptic control for human-robot
learning is exactly the field that makes robots learn and grow. From this collaborative assembly, CIRP J. Manuf. Sci. Technol. 32 (2021) 132–144, https://
doi.org/10.1016/j.cirpj.2020.11.015.
review, we can conclude that robot learning in smart robotic [22] A. Al-Yacoub, Y. Zhao, W. Eaton, Y. Goh, N. Lohse, Improving human robot
manufacturing is a research field that was just started. Though we have collaboration through force/torque based learning for object manipulation, Rob.
worked and searched throughout to make this review as complete as Comput. Integr. Manuf. 69 (2021), 102111, https://doi.org/10.1016/j.
rcim.2020.102111.
possible, omissions are still inevitable due to the limitation of our [23] S. Liu, L. Wang, X.V. Wang, Sensorless force estimation for industrial robots using
knowledge. We leave it to the community for future work. disturbance observer and neural learning of friction approximation, Rob. Comput.
Integr. Manuf. 71 (2021), 102168, https://doi.org/10.1016/j.rcim.2021.102168.
[24] Z. Liu, X. Wang, Y. Cai, W. Xu, Q. Liu, Z. Zhou, et al., Dynamic risk assessment
and active response strategy for industrial human-robot collaboration, Comput.
Declaration of Competing Interest
Ind. Eng. 141 (2020), 106302, https://doi.org/10.1016/j.cie.2020.106302.
[25] Q. Liu, Z. Liu, B. Xiong, W. Xu, Y. Liu, Deep reinforcement learning-based safe
All the authors have no conflict of interest. interaction for industrial human-robot collaboration using intrinsic reward
function, Adv. Eng. Inf. 49 (2021), 101360, https://doi.org/10.1016/j.
aei.2021.101360.
Acknowledgement [26] Z. Liu, Q. Liu, W. Xu, Z. Zhou, D.T. Pham, Human-robot collaborative
manufacturing using cooperative game: framework and implementation,
Procedia CIRP 72 (2018) 87–92, https://doi.org/10.1016/j.procir.2018.03.172.
This work was supported by National Natural Science Foundation of [27] National Research Council. Frontiers in Massive Data Analysis, National
China (Grant No. 51775399), Hubei Provincial Natural Science Foun­ Academies Press, 2013.
dation of China (Grant No. 2021CFA077), the Young Top-notch Talent [28] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to
document recognition, Proc. IEEE 86 (1998) 2278–2324, https://doi.org/
Cultivation Program of Hubei Province, and the KTH-CSC programme of
10.1109/5.726791.
China Scholarship Council and KTH Royal Institute of Technology (No. [29] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning Internal Representations by
201906950003). We sincerely appreciate the editors and the anony­ Error Propagation, California University San Diego La Jolla Institute for Cognitive
mous reviewers for their valuable work. Science, 1985.
[30] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph
neural network model, IEEE Trans. Neural Networks 20 (2008) 61–80, https://
References doi.org/10.1109/TNN.2008.2005605.
[31] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436–444,
https://doi.org/10.1038/nature14539.
[1] A. Kusiak, Smart manufacturing, Int. J. Prod. Res. 56 (2018) 508–517, https://
[32] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are
doi.org/10.1080/00207543.2017.1351644.
universal approximators, Neural Netw. 2 (1989) 359–366, https://doi.org/
[2] H.S. Kang, J.Y. Lee, S. Choi, H. Kim, J.H. Park, J.Y. Son, et al., Smart
10.1016/0893-6080(89)90020-8.
manufacturing: past research, present findings, and future directions, Int. J.
[33] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, et al., Variational autoencoder
Precision Eng. Manuf.-Green Technol. 3 (2016) 111–128, https://doi.org/
for deep learning of images, labels and captions, in: 30th Annual Conference on
10.1007/s40684-016-0015-5.
Neural Information Processing Systems (NeurIPS), 2016.
[3] L. Wang, From intelligence science to intelligent manufacturing, Engineering 5
[34] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et
(2019) 615–618, https://doi.org/10.1016/j.eng.2019.04.011.
al., Generative adversarial nets, in: 28th Annual Conference on Neural
[4] L.P. Kaelbling, The foundation of efficient robot learning, Science 369 (2020)
Information Processing Systems (NeurIPS), 2014.
915–916, https://doi.org/10.1126/science.aaz7597.
[35] D. Silver, S. Singh, D. Precup, R.S. Sutton, Reward is enough, Artif. Intell. 299
[5] J. Cui, J. Trinkle, Toward next-generation learned robot manipulation, Sci.
(2021), 103535, https://doi.org/10.1016/j.artint.2021.103535.
Robot. 6 (2021), https://doi.org/10.1126/scirobotics.abd9461.
[36] L. Wang, S. Liu, H. Liu, X.V. Wang, Overview of human-robot collaboration in
[6] O. Kroemer, S. Niekum, G. Konidaris, A review of robot learning for
manufacturing, in: Proceedings of 5th International Conference on the Industry
manipulation: challenges, representations, and algorithms, J. Mach. Learn. Res.
4.0 Model for Advanced Manufacturing, 2020.
22 (2021), 30:31-30:82.
[37] K. Morik, M. Kaiser, V. Klingspor, Making Robots Smarter: Combining Sensing
[7] Z. Xie, Q. Zhang, Z. Jiang, H. Liu, Robot learning from demonstration for path
and Action Through Robot Learning, Springer Science & Business Media, New
planning: a review, Sci. China: Technol. Sci. 63 (2020) 1–10, https://doi.org/
York, 2012.
10.1007/s11431-020-1648-4.
[38] J. Peters, D.D. Lee, J. Kober, D. Nguyen-Tuong, J.A. Bagnell, S. Schaal, Robot
[8] H. Ravichandar, A.S. Polydoros, S. Chernova, A. Billard, Recent advances in robot
learning. Springer handbook of Robotics, Springer, 2016, pp. 357–398.
learning from demonstration, Ann. Rev. Control, Robot. Autonomous Syst. 3
[39] A. Cully, J. Clune, D. Tarapore, J.B. Mouret, Robots that can adapt like animals,
(2020) 297–330, https://doi.org/10.1146/annurev-control-100819-063206.
Nature 521 (2015) 503–507, https://doi.org/10.1038/nature14422.
[9] W. Wang, Y. Chen, R. Li, Y. Jia, Learning and comfort in human–robot
[40] I. Rahwan, M. Cebrian, N. Obradovich, J. Bongard, J.F. Bonnefon, C. Breazeal, et
interaction: a review, Appl. Sci. 9 (2019) 5152, https://doi.org/10.3390/
al., Machine behaviour, Nature 568 (2019) 477–486, https://doi.org/10.1038/
app9235152.
s41586-019-1138-y.
[10] Z. Zhu, H. Hu, Robot learning from demonstration in robotic assembly: a survey,
[41] A. Billard, D. Kragic, Trends and challenges in robot manipulation, Science 364
Robotics 7 (2018) 17, https://doi.org/10.3390/robotics7020017.
(2019), https://doi.org/10.1126/science.aat8414.
[11] T. Sebastian, B. Wolfram, F. Dieter, Probabilistic Robotics, MIT Press, 2005.
[42] G. Cheng, K. Ramirez-Amaro, M. Beetz, Y. Kuniyoshi, Purposive learning: robot
[12] J. Wallén, The History of the Industrial Robot, Linköping University Electronic
reasoning about the meanings of human activities, Sci. Robot. 4 (2019), https://
Press, 2008.
doi.org/10.1126/scirobotics.aav1530.
[13] V. Digani, L. Sabattini, C. Secchi, C. Fantuzzi, Ensemble coordination approach in
[43] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, et al.,
multi-agv systems applied to industrial warehouses, IEEE Trans. Autom. Sci. Eng.
Learning ambidextrous robot grasping policies, Sci. Robot. 4 (2019), https://doi.
12 (2015) 922–934, https://doi.org/10.1109/TASE.2015.2446614.
org/10.1126/scirobotics.aau4984.
[14] F. Guérin, F. Guinand, J.F. Brethé, H. Pelvillain, UAV-UGV cooperation for objects
[44] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, et al.,
transportation in an industrial area, in: 2015 IEEE International Conference on
Learning agile and dynamic motor skills for legged robots, Sci. Robot. 4 (2019),
Industrial Technology (ICIT), 2015.
https://doi.org/10.1126/scirobotics.aau5872.
[15] J. Nikolic, M. Burri, J. Rehder, S. Leutenegger, C. Huerzeler, R. Siegwart, A UAV
[45] J. Ichnowski, Y. Avigal, V. Satish, K. Goldberg, Deep learning can accelerate
system for inspection of industrial facilities, 2013 IEEE Aerospace Conference,
grasp-optimized motion planning, Sci. Robot. 5 (2020), https://doi.org/10.1126/
2013.
scirobotics.abd7710.
[16] Q. Liu, Z. Liu, W. Xu, Q. Tang, Z. Zhou, D.T. Pham, Human-robot collaboration in
[46] S. Sundaram, Robots learn to identify objects by feeling, Sci. Robot. 5 (2020),
disassembly for sustainable manufacturing, Int. J. Prod. Res. 57 (2019)
https://doi.org/10.1126/scirobotics.abf1502.
4027–4044, https://doi.org/10.1080/00207543.2019.1578906.
[47] S. Chernova, A.L. Thomaz, Robot learning from human teachers, Synth. Lect.
[17] Y. Lu, X. Xu, L. Wang, Smart manufacturing process and system automation–a
Artif. Intell. Mach. Learn. 8 (2014) 1–121, https://doi.org/10.2200/
critical review of the standards and envisioned scenarios, J. Manuf. Syst. 56
S00568ED1V01Y201402AIM028.
(2020) 312–325, https://doi.org/10.1016/j.jmsy.2020.06.010.
[48] D.O. Won, K.R. Muller, S.W. Lee, An adaptive deep reinforcement learning
[18] Y. Qu, X. Ming, Z. Liu, X. Zhang, Z. Hou, Smart manufacturing systems: state of
framework enables curling robots with human-like performance in real-world
the art and future trends, Int. J. Adv. Manuf. Technol. 103 (2019) 3751–3768,
conditions, Sci. Robot. 5 (2020), https://doi.org/10.1126/scirobotics.abb9764.
https://doi.org/10.1007/s00170-019-03754-7.
[49] M. Lazaro-Gredilla, D.H. Lin, J.S. Guntupalli, D. George, Beyond imitation: zero-
[19] L. Wang, R. Gao, J. Váncza, J. Krüger, X.V. Wang, S. Makris, et al., Symbiotic
shot task transfer on robots by learning concepts as cognitive programs, Sci.
human-robot collaborative assembly, CIRP Ann. Manuf. Technol. 68 (2019)
Robot. 4 (2019), https://doi.org/10.1126/scirobotics.aav3150.
701–726, https://doi.org/10.1016/j.cirp.2019.05.002.

18
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

[50] M.R. Pedersen, L. Nalpantidis, R.S. Andersen, C. Schou, S. Bøgh, V. Krüger, et al., [80] J. Schulman, P. Moritz, S. Levine, M. Jordan, P. Abbeel, High-dimensional
Robot skills for manufacturing: from concept to industrial deployment, Rob. continuous control using generalized advantage estimation, in: The International
Comput. Integr. Manuf. 37 (2016) 282–291, https://doi.org/10.1016/j. Conference on Learning Representations (ICLR), 2016.
rcim.2015.04.002. [81] P. Thomas, Bias in natural actor-critic algorithms, in: International Conference on
[51] A. Bemporad, M. Morari, V. Dua, E.N. Pistikopoulos, The explicit linear quadratic Machine Learning (ICML), 2014.
regulator for constrained systems, Automatica 38 (2002) 3–20, https://doi.org/ [82] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, et al.,
10.1016/S0005-1098(01)00174-1. Asynchronous methods for deep reinforcement learning, in: International
[52] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, et al., Conference on Machine Learning (ICML), 2016.
Human-level control through deep reinforcement learning, Nature 518 (2015) [83] S. Fujimoto, H. Hoof, D. Meger, Addressing function approximation error in actor-
529–533, https://doi.org/10.1038/nature14236. critic methods, in: International Conference on Machine Learning (ICML), 2018.
[53] R.S. Sutton, Dyna, an integrated architecture for learning, planning, and reacting, [84] T. Haarnoja, H. Tang, P. Abbeel, S. Levine, Reinforcement learning with deep
ACM Sigart Bull. 2 (1991) 160–163, https://doi.org/10.1145/122344.122377. energy-based policies, in: International Conference on Machine Learning (ICML),
[54] S. Ross, G. Gordon, D. Bagnell, A reduction of imitation learning and structured 2017.
prediction to no-regret online learning, in: Proceedings of the fourteenth [85] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: off-policy maximum
international conference on artificial intelligence and statistics, 2011. entropy deep reinforcement learning with a stochastic actor, in: International
[55] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, et al., Conference on Machine Learning (ICML), 2018.
End to end learning for self-driving cars, arXiv preprint arXiv:160407316. (2016). [86] B. Yao, Z. Zhou, L. Wang, W. Xu, Q. Liu, Sensor-less external force detection for
[56] A. Giusti, J. Guzzi, D.C. Cireşan, F.L. He, J.P. Rodríguez, F. Fontana, et al., industrial manipulators to facilitate physical human-robot interaction, J. Mech.
A machine learning approach to visual perception of forest trails for mobile Sci. Technol. 32 (2018) 4909–4923, https://doi.org/10.1007/s12206-018-0939-
robots, IEEE Robot. Autom. Lett. 1 (2015) 661–667, https://doi.org/10.1109/ 5.
LRA.2015.2509024. [87] P.T. De Boer, D.P. Kroese, S. Mannor, R.Y. Rubinstein, A tutorial on the cross-
[57] R. Rahmatizadeh, P. Abolghasemi, A. Behal, L. Boloni, From virtual entropy method, Ann. Oper. Res. 134 (2005) 19–67, https://doi.org/10.1007/
demonstration to real-world manipulation using LSTM and MDN, in: 32nd AAAI s10479-005-5724-z.
Conference on Artificial Intelligence (AAAI), 2018. [88] C.B. Browne, E. Powley, D. Whitehouse, S.M. Lucas, P.I. Cowling, P. Rohlfshagen,
[58] R. Rahmatizadeh, P. Abolghasemi, L. Bölöni, S. Levine, Vision-based multi-task et al., A survey of Monte Carlo tree search methods, IEEE Trans. Comput. Intell.
manipulation for inexpensive robots using end-to-end learning from AI Games 4 (2012) 1–43, https://doi.org/10.1109/TCIAIG.2012.2186810.
demonstration, in: 2018 IEEE international conference on robotics and [89] X. Guo, S. Singh, H. Lee, R.L. Lewis, X. Wang, Deep learning for real-time atari
automation (ICRA), 2018. game play using offline Monte-Carlo tree search planning, in: Advances in Neural
[59] R.J. Williams, Simple statistical gradient-following algorithms for connectionist Information Processing Systems (NeurIPS), 2014.
reinforcement learning, Mach. Learn. 8 (1992) 229–256, https://doi.org/ [90] Y. Tassa, T. Erez, E. Todorov, Synthesis and stabilization of complex behaviors
10.1007/BF00992696. through online trajectory optimization, in: 2012 IEEE/RSJ International
[60] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller, Deterministic Conference on Intelligent Robots and Systems (IROS), 2012.
policy gradient algorithms, in: International Conference on Machine Learning [91] E.F. Camacho, C.B. Alba. Model Predictive Control, Springer Science & Business
(ICML), 2014. Media, New York, 2013.
[61] S. Levine, V. Koltun, Guided policy search, in: International Conference on [92] M.P. Deisenroth, C.E. Rasmussen, D. Fox, Learning to Control a Low-Cost
Machine Learning (ICML), 2013. Manipulator Using Data-Efficient Reinforcement Learning, The MIT Press,
[62] J. Peters, S. Schaal, Reinforcement learning of motor skills with policy gradients, London, England, 2012.
Neural Netw. 21 (2008) 682–697, https://doi.org/10.1016/j. [93] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, S. Levine, SOLAR: deep
neunet.2008.02.003. structured representations for model-based reinforcement learning, in:
[63] J. Schulman, S. Levine, P. Abbeel, M. Jordan, P. Moritz, Trust region policy International Conference on Machine Learning (ICML), 2019.
optimization, in: International Conference on Machine Learning (ICML), 2015. [94] M. Watter, J.T. Springenberg, J. Boedecker, M. Riedmiller, Embed to control: a
[64] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy locally linear latent dynamics model for control from raw images, in: 29th Annual
optimization algorithms, arXiv preprint arXiv:170706347. (2017). Conference on Neural Information Processing Systems (NeurIPS), 2015.
[65] S. Lange, M. Riedmiller, A. Voigtländer, Autonomous reinforcement learning on [95] C. Finn, X.Y. Tan, Y. Duan, T. Darrell, S. Levine, P. Abbeel, Deep spatial
raw visual input data in a real world application, in: The 2012 International Joint autoencoders for visuomotor learning, in: 2016 IEEE International Conference on
Conference on Neural Networks (IJCNN), 2012. Robotics and Automation (ICRA), 2016.
[66] R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction, MIT press, [96] S. Levine, C. Finn, T. Darrell, P. Abbeel, End-to-end training of deep visuomotor
Cambridge, 2018. policies, J. Mach. Learn. Res. 17 (2016) 1334–1373.
[67] C.J.C.H. Watkins, Learning from Delayed Rewards, King’s College, 1989. [97] G. Kahn, T. Zhang, S. Levine, P. Abbeel, Plato: policy learning using adaptive
[68] M. Riedmiller, Neural fitted Q iteration–first experiences with a data efficient trajectory optimization, in: 2017 IEEE International Conference on Robotics and
neural reinforcement learning method, in: European Conference on Machine Automation (ICRA), 2017.
Learning (ECML), 2005. [98] P. Parmas, C.E. Rasmussen, J. Peters, K. Doya, Pipps: flexible model-based policy
[69] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized experience replay, in: search robust to the curse of chaos, in: International Conference on Machine
International Conference on Learning Representations (ICLR), 2016. Learning (ICML), 2018.
[70] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q- [99] D.A. Bristow, M. Tharayil, A.G. Alleyne, A survey of iterative learning control,
learning, in: Proceedings of the AAAI Conference on Artificial Intelligence IEEE Control Syst. Mag. 26 (2006) 96–114, https://doi.org/10.1109/
(AAAI), 2016. MCS.2006.1636313.
[71] R. Munos, T. Stepleton, A. Harutyunyan, M.G. Bellemare, Safe and efficient off- [100] A.A. Rusu, M. Večerík, T. Rothörl, N. Heess, R. Pascanu, R. Hadsell, Sim-to-real
policy reinforcement learning, in: 30th Annual Conference on Neural Information robot learning from pixels with progressive nets, in: Conference on Robot
Processing Systems (NeurIPS), 2016. Learning (CoRL), 2017.
[72] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, N. Freitas, Dueling network [101] Openai gym. https://gym.openai.com.
architectures for deep reinforcement learning, in: International Conference on [102] Mujoco. https://mujoco.org.
Machine Learning (ICML), 2016. [103] B. Nemec, L. Žlajpah, A. Ude, Door opening by joining reinforcement learning and
[73] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, et al., intelligent control, in: 2017 18th International Conference on Advanced Robotics
Mastering the game of Go without human knowledge, Nature 550 (2017) (ICAR), 2017.
354–359, https://doi.org/10.1038/nature24270. [104] Bullet real-time physics simulation. https://pybullet.org.
[74] S. Gu, T. Lillicrap, I. Sutskever, S. Levine, Continuous deep Q-learning with [105] Z. Erickson, V. Gangaram, A. Kapusta, C.K. Liu, C.C. Kemp, Assistive gym: a
model-based acceleration, in: International Conference on Machine Learning physics simulation framework for assistive robotics, in: 2020 IEEE International
(ICML), 2016. Conference on Robotics and Automation (ICRA), 2020.
[75] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, et al., Continuous [106] S. James, Z. Ma, D.R. Arrojo, A.J. Davison, Rlbench: the robot learning
control with deep reinforcement learning, in: The International Conference on benchmark & learning environment, IEEE Robot. Autom. Lett. 5 (2020)
Learning Representations (ICLR), 2016. 3019–3026, https://doi.org/10.1109/LRA.2020.2974707.
[76] S. Gu, E. Holly, T. Lillicrap, S. Levine, Deep reinforcement learning for robotic [107] C. Zhang, W. Xu, J. Liu, Z. Liu, Z. Zhou, D.T. Pham, Digital twin-enabled
manipulation with asynchronous off-policy updates, in: 2017 IEEE International reconfigurable modeling for smart manufacturing systems, Int. J. Computer
Conference on Robotics and Automation (ICRA), 2017. Integr. Manuf. 34 (2021) 709–733, https://doi.org/10.1080/
[77] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, et al., Qt-opt: 0951192X.2019.1699256.
scalable deep reinforcement learning for vision-based robotic manipulation, arXiv [108] F. Tao, Q. Qi, Make more digital twins, Nature 573 (2019) 490–491, https://doi.
preprint arXiv:180610293. (2018). org/10.1038/d41586-019-02849-1.
[78] R.S. Sutton, D.A. McAllester, S.P. Singh, Y. Mansour, Policy gradient methods for [109] Y. Duan, X. Chen, R. Houthooft, J. Schulman, P. Abbeel, Benchmarking deep
reinforcement learning with function approximation, in: Advances in Neural reinforcement learning for continuous control, in: International Conference on
Information Processing Systems (NeurIPS), 2000. Machine Learning (ICML), 2016.
[79] S. Gu, T. Lillicrap, Z. Ghahramani, R.E. Turner, S. Levine, Q-prop: sample-efficient [110] J.P.C. de Souza, L.F. Rocha, P.M. Oliveira, A.P. Moreira, J. Boaventura-Cunha,
policy gradient with an off-policy critic, in: The International Conference on Robotic grasping: from wrench space heuristics to deep learning policies, Rob.
Learning Representations (ICLR), 2017. Comput. Integr. Manuf. 71 (2021), 102176, https://doi.org/10.1016/j.
rcim.2021.102176.

19
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

[111] M. Moosmann, M. Kulig, F. Spenrath, M. Mönnig, S. Roggendorf, O. Petrovic, et [136] T. Fu, F. Li, Y. Zheng, R. Song, Process learning of robot fabric manipulation based
al., Separating entangled workpieces in random bin picking using deep on composite reward functions, in: International Conference on Intelligent
reinforcement learning, Procedia CIRP 104 (2021) 881–886, https://doi.org/ Robotics and Applications, 2021.
10.1016/j.procir.2021.11.148. [137] M.S.A. Hameed, M.M. Khan, A. Schwung, Curiosity based reinforcement learning
[112] L. Berscheid, C. Friedrich, T. Kröger, Robot learning of 6 dof grasping using on robot manufacturing cell, arXiv preprint arXiv:201108743. (2020).
model-based adaptive primitives, arXiv preprint arXiv:210312810. (2021). [138] M.B. Imtiaz, Y. Qiao, B. Lee, A comparison of two reinforcement learning
[113] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, D. Quillen, Learning hand-eye algorithms for robotic pick and place with non-visual sensing, Int. J. Mech. Eng.
coordination for robotic grasping with deep learning and large-scale data Robot. Res. 10 (2021) 526–535, https://doi.org/10.18178/ijmerr.10.10.526-535.
collection, Int. J. Robot. Res. 37 (2018) 421–436, https://doi.org/10.1177/ [139] M.A.K. Jaradat, M. Al-Rousan, L. Quadan, Reinforcement based mobile robot
0278364917710318. navigation in dynamic environment, Rob. Comput. Integr. Manuf. 27 (2011)
[114] M.Q. Mohammed, K.L. Chung, C.S. Chyi, Pick and place objects in a cluttered 135–149, https://doi.org/10.1016/j.rcim.2010.06.019.
scene using deep reinforcement learning, Int. J. Mech. Mechatron. Eng. 20 (2020) [140] V.N. Sichkar, Reinforcement learning algorithms in global path planning for
50–57. mobile robot, in: 2019 International Conference on Industrial Engineering,
[115] Y. Wang, S. Zhu, Q. Zhang, R. Zhou, R. Dou, H. Sun, et al., A visual grasping Applications and Manufacturing (ICIEAM), 2019.
strategy for improving assembly efficiency based on deep reinforcement learning, [141] Y. Liu, J. Yao, T. Lin, H. Xu, F. Shi, Y. Xiao, et al., A framework for industrial robot
J. Sens. 2021 (2021), https://doi.org/10.1155/2021/8741454. training in cloud manufacturing with deep reinforcement learning, in:
[116] J. Yao, Y. Liu, T. Lin, X. Ping, H. Xu, W. Wang, et al., Robotic grasping training International Manufacturing Science and Engineering Conference, 2020.
using deep reinforcement learning with policy guidance mechanism, in: [142] J. Luo, C. Yang, H. Su, C. Liu, A robot learning method with physiological
International Manufacturing Science and Engineering Conference, 2021. interface for teleoperation systems, Appl. Sci. 9 (2019) 2099, https://doi.org/
[117] S.C. Akkaladevi, M. Plasch, A. Pichler, M. Ikeda, Towards reinforcement based 10.3390/app9102099.
learning of an assembly process for human robot collaboration, Procedia Manuf. [143] A. Maldonado-Ramirez, R. Rios-Cabrera, I. Lopez-Juarez, A visual path-following
38 (2019) 1491–1498, https://doi.org/10.1016/j.promfg.2020.01.138. learning approach for industrial robots using drl, Rob. Comput. Integr. Manuf. 71
[118] H. Chen, B. Li, D. Gravel, G. Zhang, B. Zhang, Robot learning for complex (2021), 102130, https://doi.org/10.1016/j.rcim.2021.102130.
manufacturing process, in: 2015 IEEE International Conference on Industrial [144] C. Mueller, J. Venicx, B. Hayes, Robust robot learning from demonstration and
Technology (ICIT), 2015. skill repair using conceptual constraints, in: 2018 IEEE/RSJ International
[119] H. Cheng, H. Chen, L. Hao, W. Li, Robot learning based on partial observable Conference on Intelligent Robots and Systems (IROS), 2018.
Markov decision process in unstructured environment, in: 2014 IEEE [145] Y.T. Tsai, C.H. Lee, T.Y. Liu, T.J. Chang, C.S. Wang, S. Pawar, et al., Utilization of
International Conference on Robotics and Automation (ICRA), 2014. a reinforcement learning algorithm for the accurate alignment of a robotic arm in
[120] H. Cheng, H. Chen, Y. Liu, Pomdp based robot teaching for high precision a complete soft fabric shoe tongues automation process, J. Manuf. Syst. 56 (2020)
assembly in manufacturing automation, in: 2013 IEEE International Conference 501–513, https://doi.org/10.1016/j.jmsy.2020.07.001.
on Cyber Technology in Automation, Control and Intelligent Systems, 2013. [146] C. Wang, Q. Zhang, Q. Tian, S. Li, X. Wang, D. Lane, et al., Learning mobile
[121] J. De Winter, A. De Beir, I. El Makrini, G. Van de Perre, A. Nowé, B. Vanderborght, manipulation through deep reinforcement learning, Sensors 20 (2020) 939,
Accelerating interactive reinforcement learning by human advice for an assembly https://doi.org/10.3390/s20030939.
task by a cobot, Robotics 8 (2019) 104, https://doi.org/10.3390/ [147] C. Wang, Y. Zhao, Y. Chen, M. Tornizuka, Nonparametric statistical learning
robotics8040104. control of robot manipulators for trajectory or contour tracking, Rob. Comput.
[122] D.A. Duque, F.A. Prieto, J.G. Hoyos, Trajectory generation for robotic assembly Integr. Manuf. 35 (2015) 96–103, https://doi.org/10.1016/j.rcim.2015.03.002.
operations using learning by demonstration, Rob. Comput. Integr. Manuf. 57 [148] C. Ye, J. Yang, H. Ding, Bagging for gaussian mixture regression in robot learning
(2019) 292–302, https://doi.org/10.1016/j.rcim.2018.12.007. from demonstration, J. Intell. Manuf. (2020) 1–13, https://doi.org/10.1007/
[123] Q. Hong, L. Liu, H. Cheng, H. Chen, Robot teaching and learning based on “adult” s10845-020-01686-8.
and “child” robot concept, in: 2018 IEEE 8th Annual International Conference on [149] C. Zeng, C. Yang, Z. Chen, S.-.L. Dai, Robot learning human stiffness regulation for
CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), hybrid manufacture, Assembly Autom. 38 (2018) 539–547, https://doi.org/
2018. 10.1108/AA-02-2018-019.
[124] Y.L. Kim, K.H. Ahn, J.B. Song, Reinforcement learning based on movement [150] Y. Zhang, M. Li, C. Yang, Robot learning system based on dynamic movement
primitives for contact tasks, Rob. Comput. Integr. Manuf. 62 (2020), 101863, primitives and neural network, Neurocomputing 451 (2021) 205–214, https://
https://doi.org/10.1016/j.rcim.2019.101863. doi.org/10.1016/j.neucom.2021.04.034.
[125] W.K.H. Ko, Y. Wu, K.P. Tee, J. Buchli, Towards industrial robot learning from [151] A. Schwung, D. Schwung, M.S.A. Hameed, Cooperative robot control in flexible
demonstration, in: Proceedings of the 3rd International Conference on Human- manufacturing cells: centralized vs. distributed approaches, in: 2019 IEEE 17th
Agent Interaction, 2015. International Conference on Industrial Informatics (INDIN), 2019.
[126] M. Kyrarini, M.A. Haseeb, D. Ristić-Durrant, A. Gräser, Robot learning of [152] D. Schwung, F. Csaplar, A. Schwung, S.X. Ding, An application of reinforcement
industrial assembly task via human demonstrations, Autonom. Rob. 43 (2019) learning algorithms to industrial multi-robot stations for cooperative handling
239–257, https://doi.org/10.1007/s10514-018-9725-6. operation, in: 2017 IEEE 15th International Conference on Industrial Informatics
[127] F. Li, Q. Jiang, S. Zhang, M. Wei, R. Song, Robot skill acquisition in assembly (INDIN), 2017.
process using deep reinforcement learning, Neurocomputing 345 (2019) 92–102, [153] Z. Liu, Q. Liu, L. Wang, W. Xu, Z. Zhou, Task-level decision-making for dynamic
https://doi.org/10.1016/j.neucom.2019.01.087. and stochastic human-robot collaboration based on dual agents deep
[128] J. Luo, E. Solowjow, C. Wen, J.A. Ojea, A.M. Agogino, Deep reinforcement reinforcement learning, Int. J. Adv. Manuf. Technol. 115 (2021) 3533–3552,
learning for robotic assembly of mixed deformable and rigid objects, in: 2018 https://doi.org/10.1007/s00170-021-07265-2.
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), [154] Y. Meng, J. Su, J. Wu, Reinforcement learning based variable impedance control
2018. for high precision human-robot collaboration tasks, in: 2021 6th IEEE
[129] J. Luo, E. Solowjow, C. Wen, J.A. Ojea, A.M. Agogino, A. Tamar, et al., International Conference on Advanced Robotics and Mechatronics (ICARM),
Reinforcement learning on variable impedance controller for high-precision 2021.
robotic assembly, in: 2019 International Conference on Robotics and Automation [155] H. Oliff, Y. Liu, M. Kumar, M. Williams, M. Ryan, Reinforcement learning for
(ICRA), 2019. facilitating human-robot-interaction in manufacturing, J. Manuf. Syst. 56 (2020)
[130] K. Ota, D.K. Jha, T. Oiki, M. Miura, T. Nammoto, D. Nikovski, et al., Trajectory 326–340, https://doi.org/10.1016/j.jmsy.2020.06.018.
optimization for unknown constrained systems using reinforcement learning, in: [156] S.M. Rahman, B. Sadrfaridpour, Y. Wang, Trust-based optimal subtask allocation
2019 IEEE/RSJ International Conference on Intelligent Robots and Systems and model predictive control for human-robot collaborative assembly in
(IROS), 2019. manufacturing, in: ASME Dynamic Systems and Control Conference, 2015.
[131] J. Song, Q. Chen, Z. Li, A peg-in-hole robot assembly system based on gauss [157] L. Rozo, S. Calinon, D.G. Caldwell, P. Jimenez, C. Torras, Learning physical
mixture model, Rob. Comput. Integr. Manuf. 67 (2021), 101996, https://doi.org/ collaborative robot behaviors from human demonstrations, IEEE Trans. Rob. 32
10.1016/j.rcim.2020.101996. (2016) 513–527, https://doi.org/10.1109/TRO.2016.2540623.
[132] C.B. Kristensen, F.A. Sørensen, H.B. Nielsen, M.S. Andersen, S.P. Bendtsen, [158] Y. Sun, W. Wang, Y. Chen, Y. Jia, Learn how to assist humans through human
S. Bøgh, Towards a robot simulation framework for e-waste disassembly using teaching and robot learning in human-robot collaborative assembly, IEEE Trans.
reinforcement learning, Procedia Manuf. 38 (2019) 225–232, https://doi.org/ Syst., Man, Cybern.: Systems (2020), https://doi.org/10.1109/
10.1016/j.promfg.2020.01.030. TSMC.2020.3005340.
[133] R.E. Andersen, S. Madsen, A.B. Barlo, S.B. Johansen, M. Nør, R.S. Andersen, et al., [159] W. Wang, R. Li, Y. Chen, Z.M. Diekel, Y. Jia, Facilitating human–robot
Self-learning processes in smart factories: deep reinforcement learning for process collaborative tasks by teaching-learning-collaboration from human
control of robot brine injection, Procedia Manuf. 38 (2019) 171–177, https://doi. demonstrations, IEEE Trans. Autom. Sci. Eng. 16 (2018) 640–653, https://doi.
org/10.1016/j.promfg.2020.01.023. org/10.1109/TASE.2018.2840345.
[134] T. Brito, J. Queiroz, L. Piardi, L.A. Fernandes, J. Lima, P. Leitão, A machine [160] W. Wang, R. Li, Y. Chen, Y. Sun, Y. Jia, Predicting human intentions in human-
learning approach for collaborative robot smart manufacturing inspection for robot and-over tasks through multimodal learning, IEEE Trans. Autom. Sci. Eng.
quality control systems, Procedia Manuf. 51 (2020) 11–18, https://doi.org/ (2021) 1–15, https://doi.org/10.1109/TASE.2021.3074873.
10.1016/j.promfg.2020.10.003. [161] Y.Q. Wang, Y.D. Hu, S. El Zaatari, W.D. Li, Y. Zhou, Optimised learning from
[135] M. Duguleana, F.G. Barbuceanu, A. Teirelbar, G. Mogan, Obstacle avoidance of demonstrations for collaborative robots, Rob. Comput. Integr. Manuf. 71 (2021),
redundant manipulators using neural networks based reinforcement learning, 102169, https://doi.org/10.1016/j.rcim.2021.102169.
Rob. Comput. Integr. Manuf. 28 (2012) 132–146, https://doi.org/10.1016/j. [162] T. Yu, J. Huang, Q. Chang, Mastering the working sequence in human-robot
rcim.2011.07.004. collaborative assembly based on reinforcement learning, IEEE Access 8 (2020)
163868–163877, https://doi.org/10.1109/ACCESS.2020.3021904.

20
Z. Liu et al. Robotics and Computer-Integrated Manufacturing 77 (2022) 102360

[163] T. Yu, J. Huang, Q. Chang, Optimizing task scheduling in human-robot [168] O. Vinyals, I. Babuschkin, W.M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, et
collaboration with deep multi-agent reinforcement learning, J. Manuf. Syst. 60 al., Grandmaster level in starcraft II using multi-agent reinforcement learning,
(2021) 487–499, https://doi.org/10.1016/j.jmsy.2021.07.015. Nature 575 (2019) 350–354, https://doi.org/10.1038/s41586-019-1724-z.
[164] R. Zhang, Q. Lv, J. Li, J. Bao, T. Liu, S. Liu, A reinforcement learning method for [169] W. Xu, J. Cui, L. Li, B. Yao, S. Tian, Z. Zhou, Digital twin-based industrial cloud
human-robot collaboration in assembly tasks, Rob. Comput. Integr. Manuf. 73 robotics: framework, control approach and implementation, J. Manuf. Syst. 58
(2022), 102227, https://doi.org/10.1016/j.rcim.2021.102227. (2021) 196–209, https://doi.org/10.1016/j.jmsy.2020.07.013.
[165] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, S. Levine, How to train your [170] T. de Bruin, J. Kober, K. Tuyls, R. Babuska, Experience selection in deep
robot with deep reinforcement learning: lessons we have learned, Int. J. Robot. reinforcement learning for control, J. Mach. Learn. Res. 19 (2018).
Res. (2021), 0278364920987859 https://doi.org/10.1177% [171] T.Z. Zhao, J. Luo, O. Sushkov, R. Pevceviciute, N. Heess, J. Scholz, et al., Offline
2F0278364920987859. meta-reinforcement learning for industrial insertion, arXiv preprint arXiv:
[166] N. Sünderhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner, et al., The 211004276. (2021).
limits and potentials of deep learning for robotics, Int. J. Robot. Res. 37 (2018) [172] K. Bogert, P. Doshi, Multi-robot inverse reinforcement learning under occlusion
405–420, https://doi.org/10.1177%2F0278364918770733. with estimation of state transitions, Artif. Intell. 263 (2018) 46–73, https://doi.
[167] D.L. Leottau, J. Ruiz-Del-Solar, R. Babuska, Decentralized reinforcement learning org/10.1016/j.artint.2018.07.002.
of robot behaviors, Artif. Intell. 256 (2018) 130–159, https://doi.org/10.1016/j.
artint.2017.12.001.

21

View publication stats

You might also like