0% found this document useful (0 votes)
2 views

Reinforcement_Learning_Algorithms_in_Global_Path_Planning_for_Mobile_Robot

The paper investigates the use of Q-Learning and Sarsa algorithms for global path planning in mobile robots, focusing on their learning efficiency and performance in obstacle avoidance. Experiments conducted in virtual environments demonstrated the algorithms' ability to learn optimal paths, with findings indicating that Q-Learning outperformed Sarsa in terms of learning speed and effectiveness. The study highlights the importance of parameter optimization and the exploration-exploitation balance in reinforcement learning for mobile agents.

Uploaded by

Sameer Gulia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Reinforcement_Learning_Algorithms_in_Global_Path_Planning_for_Mobile_Robot

The paper investigates the use of Q-Learning and Sarsa algorithms for global path planning in mobile robots, focusing on their learning efficiency and performance in obstacle avoidance. Experiments conducted in virtual environments demonstrated the algorithms' ability to learn optimal paths, with findings indicating that Q-Learning outperformed Sarsa in terms of learning speed and effectiveness. The study highlights the importance of parameter optimization and the exploration-exploitation balance in reinforcement learning for mobile agents.

Uploaded by

Sameer Gulia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2019 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)

Reinforcement Learning Algorithms in Global Path


Planning for Mobile Robot
Valentyn N. Sichkar
Department of Control Systems and Robotics
ITMO University
Saint-Petersburg, Russia
[email protected]

Abstract—The paper is devoted to the research of two This is what this study is devoted to, based on the algorithm of
approaches for global path planning for mobile robots, based on Q-Learning and its modification of Sarsa.
Q-Learning and Sarsa algorithms. The study has been done with
different adjustments of two algorithms that made it possible to
II. ALGORITHM Q-LEARNING AND ITS MODIFICATION SARSA
learn faster. The implementation of two Reinforcement Learning
algorithms showed differences in learning time and the methods The task of Reinforcement Learning in general form is
of building path to avoid obstacles and to reach a destination formulated as follows. For each transition of the mobile agent
point. The analysis of obtained results made it possible to select from one state to another, the scalar value is assigned, called an
optimal parameters of the considered algorithms for the tested award. The agent receives the award for making the transition.
environments. Experiments were performed in virtual The goal is to find actions that maximize the expected reward
environments where algorithms learned which steps to choose in amount.
order to get a maximum payoff and reach the goal avoiding
obstacles. In order to accomplish this goal Q-Learning algorithm uses
Q-function, an argument of which is the action performed by
Keywords—reinforcement learning, Q-Learning algorithm, the agent [1]. This allows an iterative way to build Q-function
Sarsa algorithm, path planning, mobile agent and thereby find the optimal control policy. The expression for
updating Q-function is as following:
I. INTRODUCTION
Reinforcement Learning represents a class of tasks in  Q(xt, at) t x max Q(xt+1, at) 
which mobile robot (in this study considered as mobile agent),
acting in a particular environment, must find an optimal where rt is the reward received when system moves from the xt
strategy for interaction with it. One of the popular methods state to the xt + 1               
used to solve such problems is Q-Learning and Sarsa range from 0 to 1, and at is the action selected at time t from set
algorithms. For training mobile agent information is provided of all possible actions.
in form of reward that has a certain quantitative value for each
transition of the mobile agent from one state to another (from Q-value estimates are stored in the two-dimensional table
one point to another). No other additional information is whose inputs are state and action. Equation (1) is usually
provided to train the agent. The most important feature of Q- combined with a time difference method [2]. With the
Learning and Sarsa algorithms is that they can be used even parameter of time difference method equal to zero, only the
when the mobile agent does not have prior knowledge of the current and subsequent values of prediction of Q-values are
environment. involved in the update. In this case, the method is called one-
step Q-Learning. The expression for one-step Q-Learning is as
When Reinforcement Learning algorithms are working, the following:
state-action pairs estimation function is being constructed. In
the standard view, this function is displayed as a table whose
inputs are these state-action pairs. One of the conditions for  Q(xt, at)  Q(xt, atx (rt x max Q(xt+1, at) - Q(xt, at))  
convergence of the algorithm in the case of using a table
representation of the function is multiple testing of all possible By analysing equation (1) it can be concluded that using
state-action pairs to find the optimal path in a virtual maximum to estimate the next action is not the best solution. In
environment with obstacles. The goal of the mobile agent is to the early stages of learning, Q-value table contains estimates
find a behaviour policy that maximizes the expected amount of that are far from ideal, and even in the later stages, using
reward. Algorithms show the ability of Reinforcement maximum can lead to a re-evaluation of Q-values. In addition,
Learning when the mobile agent does not know anything about the rule for updating the Q-Learning algorithm in combination
the environment and learns the optimal behaviour in which the with the time difference algorithm requires zero-parameter
reward for actions is maximum, and the reward is awarded not value when choosing actions based on non-greedy policies. In
immediately, for some action, but for the sequence of actions. this case, a non-greedy policy is a policy in which actions are
selected with a certain probability, depending on the value of

978-1-5386-8119-0/19/$31.00 ©2019 IEEE

Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 27,2023 at 07:33:48 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
Q-functions for a given state, unlike a greedy policy, when There is also such a method of approximating Q-value table
actions with the highest Q-value are selected. These as statistical cluster analysis [7]. When using this method, each
disadvantages caused modification of the Q-Learning action is associated with a multitude of clusters that represent
algorithm, which is called SARSA (State-Action-Reward- evaluations of actions in a particular class of situations. During
State-Action). The main difference between this algorithm and updating, Q-values for current state are updated for all states
the classical one is that the max operator is removed from the belonging to this cluster. However, there are following
Q-value update rule. As a result, it is guaranteed that the time limitations of this method: the difficulty of setting parameters
difference error will be calculated correctly, regardless of for formation of semantically significant clusters and the fact
whether actions are chosen according to the greedy policy or that cluster formed once cannot be broken later.
not.
It is known that multilayer perceptron is also good
approximator of functions. There is Kolmogorov theorem on
III. Q-VALUE TABLE APPROXIMATION METHODS the mapping of neural networks (Kolmogorov mapping neural
One of the easiest ways to efficiently work with a large network existence theorem), which proves that neural networks
dimension of state space is discretization. At discretization, the of direct distribution with three layers (input layer, hidden
space of states is divided into regions of small size; each such layer, output layer) can accurately represent any continuous
region is input to table of Q-values. By using this approach, an function [8]. The use of neural networks to approximate Q-
approximation of states is obtained. Success in this case function has following advantages: effective scaling for large
directly depends on how well this partition allows us to dimension input space, generalization for large and continuous
represent the function of Q-values. On the one hand, for greater state spaces, the possibility of implementation on parallel
accuracy, it is required to divide space into smaller areas and, hardware.
as a result, to use larger Q-value table, which will result in the
need for more updates during training. On the other hand, IV. BUILDING A PATH BY MOBILE AGENT
splitting into larger areas may lead to the impossibility of
achieving optimal control policy. The described method had a The effectiveness of the algorithms described in this paper
successful application in work for the task of balancing a cart- was analysed using developed software simulator of the mobile
pole [3], which in the field of reinforcement training is agent operating in the 2-dimensional virtual environment. The
considered to be a classic example. agent was tasked to achieve the goal, avoiding collisions with
obstacles. The virtual environment is divided into cells and
There are also methods to speed up the learning process obstacles are occupied some of these cells. If the mobile agent
when using large Q-value tables. One of these methods is the falls into one, it counts as collision. Example of the
Hamming distance method [4]. When using this method, all environment in which experiments were conducted is presented
states are represented in binary form, and the similarity in Figure 1.
threshold is set, namely the number of bits by which one state
may differ from another. When Q-values are corrected,
updating is simultaneously performed both for selected state
and for all states to which the Hamming distance from the
selected one is less than the specified threshold. Consequently,
the spread of Q-values in the table is accelerated.
Another method called CMAC (Cerebellar Model
Articulator Controller) is a compromise between using simple
Q-value table and continuous approximation of the function
[5]. CMAC approximation structure consists of several layers.
Each layer is divided into intervals of the same length (called
as tiles) using a quantizing function. Since each layer has its
own quantization function, the tiles of the layers are shifted
relative to each other. Consequently, the state of the system
applied to inputs of the CMAC is matched with a set of
overlapping offset tiles. However, despite successful
application, this algorithm requires fairly complex settings. The
accuracy of the approximated function is limited by the
resolution of quantization. High quantization accuracy requires
more weights and longer study of the environment.
RBF networks (Radial Bases Functions) are closely related
to CMAC and simple tables [6]. When using this method of
approximation, instead of Q-values table, a grid of Gauss
functions or quadratic functions is stored. State of the system is Fig. 1. Virtual environment with obstacles and found path
passed through all functions, after which values of the
The initial position of the mobile agent is in the upper left
functions are summed, and the result is approximated value.
corner. Figure 1 shows the path found after training. At each
step of operation, the agent could choose one of four possible

Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 27,2023 at 07:33:48 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
actions: moving forward, moving backwards, moving left, Charts show the number of episodes over the number of
moving right. Exceptions are the boundary and angular steps (Figure 2) and the number of episodes over cost for each
positions, where one or more options to act are missing. At the episode (Figure 3). From the charts, it can be seen that starting
same time, in one action the mobile agent can take only one from about 300th episode, the mobile agent found the path to
step in one of the chosen direction. the target, and its award for committed actions grows.
The Q-value table is represented by the possible actions of
the mobile agent and is filled with rewards according to the TABLE I. Q-VALUES FOR Q-LEARNING ALGORITHM WITH FOUND
chosen behaviour. In Reinforcement Learning algorithms, BEST ACTIONS
processes of exploration and exploitation play an important Up Down Right Left
role. At the first stage, it is necessary to investigate the
1.167617e-05 7.172375e-04 -3.701764e-01 9.011239e-06
environment as best as possible by choosing lower priority
actions. At the final stages, it is necessary to proceed directly to 0.000028 -0.222179 1.712710e-03 0.000015
the operation, choosing more priority actions. A smooth 0.000054 0.003944 7.389733e-07 0.000067
transition between exploration and exploitation can be 0.000099 0.008411 2.200491e-06 0.000068
described using Boltzmann distribution, which has the
0.000105 0.017174 -2.822695e-01 0.000549
following form:
0.000071 0.000034 1.853133e-02 0.000451
-0.206386 0.037152 -2.221786e-02 0.000222
 P(at | xt) = [exp(Q(xt, at) / T)] / [exp(Q(xt, at) / T)] 
0.001505 0.000234 6.961416e-02 0.000018
where T is the temperature that controls the degree of -0.267697 0.125094 7.383372e-04 0.001612
randomness of the choice of action with the highest Q-value. 0.000422 -0.252828 2.666055e-01 0.011204
0.000012 0.379808 -1.485422e-01 0.013184
V. EXPERIMENTAL RESULTS 0.009001 0.510842 -2.063857e-01 -0.182093
Experiments were done in the virtual environment with 0.027898 0.040092 6.458811e-01 0.001964
obstacles and the mobile agent was as the object that moved in
-0.104662 0.060179 7.727067e-01 0.015624
this virtual environment, studying it in order to find the path to
the target point. Program code was written in Python 3 using 0.888066 0.061701 8.728341e-03 0.093561
libraries to visualize moving process of mobile agent as the 0.997643 0.093159 -1.570568e-01 -0.206386
object in the environment. At the initial stage of training,
visualization of agent transitions from one point to another was
especially slowed down that it could be possible to see and to
study its behaviour and adjust the parameters.
Experiments were conducted for Q-Learning algorithm and
its modifications Sarsa with different parameters. A total of 50
experiments were conducted. It was necessary to pay special
attention to the range of temperature variation T and its rate of
change since the convergence of the algorithm strongly
depends on these parameters. Experimentally was established
that the optimal solution between quality and learning speed is
the use of interval for T from 0.01 to 0.04 for 1000 stages.
Final table with Q-values for Q-Learning algorithm is shown in
Table I.
Q-values of the filled table shows which final actions were
chosen by the agent after studying the environment (the values
are highlighted in grey). The sequence of final actions to
achieve the goal after Q-table is filled with knowledge is as
following: down-right-down-down-down-right-down-right-
down-right-down-down-right-right-up-up. In the course of the Fig. 2. Episode via steps for Q-Learning algorithm
experiment with the Q-Learning algorithm, the shortest path
for achieving the goal consists of 16 steps, and the longest path Q-Learning algorithm showed the best result with
to the goal is 185 steps. Figure 2 and Figure 3 shows charts          ! "         ##
with learning process for Q-Learning algorithm. (discount factor). Table II displays the final comparison of Q-
Learning and Sarsa algorithms.

Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 27,2023 at 07:33:48 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
Experiments for comparison of Q-Learning and Sarsa were Figure 4 shows that when using Q-Learning algorithm (in
also done in another virtual environment in order to visualize the upper part), the mobile agent tries to minimize the number
the difference in the behaviour of the mobile agent. The initial of necessary actions and achieve the goal as quickly as
position of the mobile agent was also located in the upper left possible. However, in this case, the risk of falling off the cliff
corner, and occupied cells were represented as a cliff. The goal increases. In the case of using Sarsa algorithm (in the bottom
of the agent is to reach the top right corner and not fall into the part), the agent gives priority to security and finds the optimal
cliff. The paths found by the mobile agent for Q-Learning safe distance to the cliff to minimize the risk of falling.
algorithm and for Sarsa are shown in Figure 4. However, in this case, the number of necessary actions to
achieve the goal increases.

Fig. 3. Episode via cost for Q-Learning algorithm

TABLE II. COMPARISON OF Q-LEARNING AND SARSA ALGORITHMS


Fig. 5. Q-Learning algorithm learns to avoid falling from the cliff
Parameters
Algorithm
Learning Discount Minimum Maximum
rate factor steps steps
Q-Learning 0.5 0.99 16 185
Sarsa 0.5 0.99 19 235

Fig. 6. Sarsa algorithm learns to avoid falling from the cliff

Charts in Figure 5 and Figure 6 also show the difference in


the performance of Q-Learning and Sarsa algorithms learning
how to avoid falling from the cliff. It can be seen that Sarsa
algorithm needs more cost during learning in comparison of Q-
Learning algorithm. And it means that Sarsa takes more steps
to reach the goal than Q-Learning.
Fig. 4. Comparison analysis of Q-Learning and Sarsa algorithms

Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 27,2023 at 07:33:48 UTC from IEEE Xplore. Restrictions apply.
2019 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
VI. CONCLUSIONS [16] R. Lowe, T. Ziemke, “Exploring the relationship of reward and
punishment in reinforcement learning,” 2013 IEEE Symposium on
This paper studies Q-Learning algorithm and its Adaptive Dynamic Programming and Reinforcement Learning
modifications of Sarsa algorithm for the tasks of finding the (ADPRL), Singapore, pp. 140-147, 2013.
path in given environment. Experiments with these algorithms [17] R. Zhang, P. Tang, Y. Su, X. Li, G. Yang, C. Shi, “An adaptive obstacle
were conducted with software simulator of the agent operating avoidance algorithm for unmanned surface vehicle in complicated
marine environments,” IEEE/CAA Journal of Automatica Sinica, vol. 1,
in the virtual environment with obstacles. During experiments, pp. 385-396, 2014.
the optimal parameters of the algorithms were found, and their
[18] R. Ozakar, B. Ozyer, “Ball-cradling using reinforcement algorithms,”
efficiency was compared. Algorithms showed the fastest 2016 National Conference on Electrical, Electronics and Biomedical
convergence with the parameter of learning rate  = 0.5 and Engineering (ELECO), pp. 135-141, 2016.
with the parameter of discount factor  = 0.99. Q-Learning [19] W. Sause, “Coordinated Reinforcement Learning Agents in a Multi-
showed faster convergence than its Sarsa modification. agent Virtual Environment,” 2013 12th International Conference on
However, with Sarsa algorithm the agent moved along the Machine Learning and Applications, pp. 227-230, 2013.
safer trajectory, which was shown by additional experiments [20] D. A. Vidhate, P. Kulkarni, “Enhanced Cooperative Multi-agent
on another virtual environment with simulation of the cliff. Learning Algorithms (ECMLA) using Reinforcement Learning,” 2016
International Conference on Computing, Analytics and Security Trends
Consequently, each of the investigated algorithms has its own (CAST), pp. 556-561, 2016.
advantages in speed (Q-Learning) and in safety (Sarsa), which [21] B. N. Araabi, S. Mastoureshgh, M. N. Ahmadabadi, “A Study on
makes them possible to use in solving certain types of tasks. Expertise of Agents and Its Effects on Cooperative Q-Learning,” IEEE
Transactions on Evolutionary Computation, vol. 14, pp. 23-57, 2010.
REFERENCES [22] A. Deepak, P. Kulkarni, “New Approach for Advanced Cooperative
Learning Algorithms using RL methods (ACLA),” VisionNet’16
[1] K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A. Bharath, “Deep Proceedings of the Third International Symposium on Computer Vision
Reinforcement Learning: A Brief Survey,” IEEE Signal Processing and the Internet, 2016.
Magazine, vol. 34, pp. 26-38, 2017.
[23] M. Fairbank, E. Alonso, “The divergence of reinforcement learning
[2] R. Sutton, S. Richard, “Learning to predict by the methods of temporal algorithms with value-iteration and function approximation,” 2012
differences,” Machine Learning, vol. 3, pp. 9-44, 1988. International Joint Conference on Neural Networks (IJCNN), 2012, pp.
[3] S. Nagendra, N. Podila, R. Ugarakhod, K. George, “Comparison of 1-8, 2012.
reinforcement learning algorithms applied to the cart-pole problem,” [24] Y. Yuequan, J. Lu, C. Zhiqiang, T. Hongru, X. Yang, N. Chunbo, “A
2017 International Conference on Advances in Computing, survey of reinforcement learning research and its application for multi-
Communications and Informatics (ICACCI), pp. 26-32, 2017. robot systems,” Proceedings of the 31st Chinese Control Conference,
[4] H. Jegou, M. Douze, C. Schmid, “Product quantization for nearest pp. 3068-3074, 2012.
neighbor search,” IEEE Transactions on Pattern Analysis and Machine [25] W. Xu, J. Huang, Y. Wang, C. Tao, X. Gao, “Research of reinforcement
Intelligence, vol. 33, pp. 117–128, 2011. learning based share control of walking-aid robot,” Proceedings of the
[5] L. Kurtaj, V. Shatri, I. Limani, “On-line learning of robot inverse 32nd Chinese Control Conference, pp. 5883-5888, 2013.
dynamics with cerebellar model controller in feedforward [26] M. van der Ree, M. Wiering, “Reinforcement learning in the game of
configuration,” International Journal of Mechanical Engineering and Othello: Learning against a fixed opponent and learning from self-play,”
Technology, vol. 9, pp. 445-460, 2018. 2013 IEEE Symposium on Adaptive Dynamic Programming and
[6] P. Roy, S. Adhikari, “Radial Basis Function based Self-Organizing Map Reinforcement Learning (ADPRL), pp. 108-115, 2013.
Model,” IOSR Journal of Engineering, vol. 8, pp. 46-52, 2018. [27] Zhi-Xiong Xu, Xi-Liang Chen, Lei Cao, Chen-Xi Li, “A study of count-
[7] B. Everitt, S. Landau S., M. Leese and D. Stahl, “Cluster Analysis,” based exploration and bonus for reinforcement learning,” 2017 IEEE
Wiley, 5th edn., 2011. 2nd International Conference on Cloud Computing and Big Data
[8] B. Igelnik, N. Parikh, “Kolmogorov's spline network,” IEEE Analysis (ICCCBDA), pp. 425-429, 2017.
Transactions on Neural Networks, vol. 14-4, pp. 725-33, 2003. [28] S. Wender, I. Watson, “Applying reinforcement learning to small scale
[9] H. V. Hasselt, A. Guez, D. Silver, “Deep reinforcement learning with combat in the real-time strategy game StarCraft:Broodwar,” 2012 IEEE
double Q-learning,” AAAI Conference on Artificial Intelligence, pp. Conference on Computational Intelligence and Games (CIG), pp. 402-
2094–2100, 2016. 408, 2012.
[10] E. Even-Dar, Y. Mansour, “Learning rates for Q-learning,” Journal of [29] N. Chauhan, N. Choudhary, K. George, “A comparison of reinforcement
Machine Learning Research, vol. 5, pp. 1–25, 2003. learning based approaches to appliance scheduling,” 2016 2nd
International Conference on Contemporary Computing and Informatics
[11] H. Iima, Y. Kuroe, “Swarm reinforcement learning algorithms based on (IC3I), pp. 253-258, 2016.
Sarsa method,” SICE Annual Conference, Tokyo, 2008, pp. 2045-2049.
[30] F. Cardenoso Fernandez, W. Caarls, “Parameters Tuning and
[12] A. Edwards, W. M. Pottenger, “Higher order Q-Learning,” 2011 IEEE Optimization for Reinforcement Learning Algorithms Using
Symposium on Adaptive Dynamic Programming and Reinforcement Evolutionary Computing,” 2018 International Conference on
Learning, pp. 128-134, 2011. Information Systems and Computer Science (INCISCOS), pp. 301-305,
[13] D. Xu, Y. Fang, Z. Zhang, Y. Meng, “Path Planning Method Combining 2018.
Depth Learning and Sarsa Algorithm,” 2017 10th International [31] W. Lu, J. Yang, H. Chu, “Playing Mastermind Game by Using
Symposium on Computational Intelligence and Design (ISCID), Reinforcement Learning,” 2017 First IEEE International Conference on
Hangzhou, pp. 77-82, 2017. Robotic Computing (IRC), pp. 418-421, 2017.
[14] F. Tavakoli, V. Derhami, A. Kamalinejad, “Control of humanoid robot [32] M. D. Kaba, M. G. Uzunbas, S. N. Lim, “A Reinforcement Learning
walking by Fuzzy Sarsa Learning,” 2015 3rd RSI International Approach to the View Planning Problem,” 2017 IEEE Conference on
Conference on Robotics and Mechatronics (ICROM), Tehran, pp. 234- Computer Vision and Pattern Recognition (CVPR), pp. 5094-5102,
239, 2015. 2017.
[15] A. Habib, M. I. Khan, J. Uddin, “Optimal route selection in complex [33] H. Cetin, A. Durdu, “Path planning of mobile robots with Q-learning,”
multi-stage supply chain networks using SARSA(),” 2016 19th 22nd Signal Processing Conference, pp. 2162-2165, 2014.
International Conference on Computer and Information Technology
(ICCIT), Dhaka, pp. 170-175, 2016.

Authorized licensed use limited to: Indian Institute of Technology Patna. Downloaded on March 27,2023 at 07:33:48 UTC from IEEE Xplore. Restrictions apply.

You might also like