Application of Neural Q-Learning Controllers On The Khepera II Via Webots Software
Application of Neural Q-Learning Controllers On The Khepera II Via Webots Software
Velappa Ganapathy is with the School of Engineering, However, its original standard tabular
Monash University Sunway Campus, Jalan Lagoon Selaton,
46150 Banday Sunway (phone: +60-3-55146250; fax:+60-3-
formation used to hold Q-values would not yield
55146207; e-mail: [email protected]). an efficient system. For instance, a robot with
Wen Lik Dennis Lui., is with the School of Engineering, eight sensors which has an input range of 0-1022
Monash University Clayton Campus, 3168 VIC Australia. (e-
mail:[email protected]).
for each sensor with five actions to choose from
will require a (1.1995 x 1024) x 5 Q-table. To
Department of Mechanical Engineering, Mepco Schlenk Engineering College, Sivakasi, India
efficiently update and read the Q-values from behavior for the robot and its inputs are made up
such a large table would impose a serious of the 8 proximity sensors.
problem. The only was to learn anything at all on The other category of reinforcement learning
these tasks is to generalize from previously controllers is the vision-based controllers. The
experienced states to ones that have never been previously discussed works are all sensor-based
seen. As such, the standard tabular formation had controllers. The most basic vision based
been replaced by function approximators such as controllers are those which directly input the
neural networks. captured image to the neural network. This is
illustrated in the work of Iida, M. et al [9] and
II. RELATED WORKS Shibata, K. and Iida, M. [10]. The former utilized
a linear grayscale camera (1x64 pixels) and the
Apart from applying neural networks to Q- latter utilized a CCD camera (320x240 pixels).
learning, it has also been applied onto actor-critic The actor-critic architecture was utilized to
architectures. The types of neural network enable the mobile robot to orientate itself towards
utilized for reinforcement learning algorithms are an object and pushes it towards the wall.
the backpropagation feed-forward neural To further improve the behavior of vision
network, recurrent neural network and the self based controllers, Gaskett, C. et al [11] had
organizing maps. Jakša, R. et al [3], Yang, G.-S. utilized a continuous state, continuous action
et al [4] and Huang, B.-Q. et al [5] had similarly reinforcement learning algorithm based on a
approached the problem of mobile robot multilayered feed-forward neural network
navigation through the combination of the combined with an interpolator. This interpolation
multilayer feed-forward neural network with scheme is known as ‘wire-fitting’. The ‘wire-
reinforcement learning algorithms. Jakša, R. et al fitting’ function is a moving least squares
[3] had utilized an actor critic architecture interpolator which is used to increase the speed
whereas the latter two had utilized the Q-Learning experienced during the Q-value updating process.
algorithm. All results obtained are verified via It allows the updating process to be conducted
simulation results only. In addition, the developed whenever it is convenient. The simulation results
reinforcement learning controllers are entirely show that the robot is capable of demonstrating
based on the mobile robots’ sensors. wandering and servoing behaviors through trial
The main difference between the recurrent and error using reinforcement learning.
neural networks and the backpropagation feed-
forward neural networks is their internal structure. III. SYSTEM OVERVIEW
The input layer of the recurrent neural network is Webots [12] is a commercial mobile robot
divided into two parts; the true input units and the simulation software used by over 250 universities
context units. The context units simply hold a and research centers worldwide to model,
copy of the activations of the hidden units from program and simulate mobile robotics. The main
the previous time steps. In 1998, Onat, A. et al [6] reason for its increasing popularity is due to its
had thoroughly discussed the architecture, ability to reduce the overall development time.
learning algorithms and internal representation of Using this platform, a flexible simulation for the
recurrent neural networks for reinforcement Khepera II was developed. Some of the notable
learning and had performed comparisons across features in the simulation are the custom maze
the different types of network architectures and design feature, repositioning and reorientation of
the sensors, changing wall and floor textures,
learning algorithms through a simple problem. At
light intensities, etc.
the same time, Cervera, E. and del Pobil, A.P. [7]
In addition, Webots was interfaced with the
had not only applied the recurrent neural network
Microsoft Visual C++ .Net 2002 integrated
for a sensor-based goal finding task, but the duo development environment by using a combination
extended it by proposing a new method for state of MC++, C++ and C programs. Furthermore, the
identification that eliminates sensor ambiguities. C programs were interfaced to Matlab 7.1
The implementation of self organizing maps through the Matlab engine. Matlab was further
for Q-learning was further illustrated by Sehad, S. used to communicate to the serial port such that
and Touzet, C. [8]. This network learns without control commands could be sent to the real robot
the requirements of supervision and it could be and vice-versa via the radio base and radio turret.
able to detect irregularities and correlations in the The interaction of the various modules of system
input, and adapt to that accordingly. The pair had is shown in Fig. 2.
used the self organizing maps together with Q-
learning to develop an obstacle avoidance
2
International Conference on Fascinating Advancement in Mechanical Engineering (FAME2008),
11-13, December 2008
Neural
(v) Determine an action, a according to the
Network, Boltzmann Probability Distribution (during
Image
Processing learning) or the equation a = max(Q(s,a))
MC++ & C++ Tool Box
Autodesk 3ds Supervisor (after learning).
Max
MC++, C++ & C (vi) Robot takes action, a and reach a new
Controller
position. Get current state.
(vii) If a collision occurred, a negative
numerical reward signal will be granted
Matlab Session
and the robot is reset back to its initial
position.
(viii) Then, generate Qtarget according to the
equation:
where Tmin and β (0 < β ≤ 1) are constants. Thus, B. Sensor-based Wall Following Controller
as T approaches Tmin, the robot will change from This controller is basically an extension over
exploration to exploitation of the learnt policy. the sensor-based obstacle avoidance RL
The robot will have a total of 5 actions to select at controller. The network configurations and
any state. These actions are illustrated in Fig. 3. parameters used are the same for both controllers.
The only difference between this and the previous
controller is its reward function. By altering the
reward function, the robot learns to refine its
behavior from an obstacle avoidance behavior to
a much refined wall following behavior. The
reward function is designed as,
A collision is defined to take place when any Looking at the size of the input image, it
one of the five front sensors reads in a value suggests that if the image is to be applied directly
exceeding 600. If so, the reward for that iteration to the neural network, the neural network will
is -10.00. require 64 neurons on its input layer. However,
via several experiments, it was found that the
neural network was not able to generalize. Thus,
4
International Conference on Fascinating Advancement in Mechanical Engineering (FAME2008),
11-13, December 2008
the 64 grayscale values are divided into 8 • If (ds0 is medium) and (a0 is medium) then
segments, each containing 8 pixels. For each (output1 is medium) (1)
segment, the average grayscale value is calculated • If (ds0 is medium) and (a0 is low) then
and fed to the neural network on each iteration. (output1 is medium) (1)
This conversion results in 8 average grayscale • If (ds0 is low) and (a0 is high) then (output1
values for the 8 segments. Similarly, the neural is high) (1)
network architecture shown in Fig. 4 could be • If (ds0 is low) and (a0 is medium) then
applied to this controller. The reward function (output1 is medium) (1)
was designed similar to the sensor-based obstacle • If (ds0 is low) and (a0 is low) then (output1
avoidance controller. is low) (1)
(c)
(a) (b) Fig. 11 Obstacle Avoidance Controller based on
Combination of Sensor and Vision Inputs (a) Simulation
Fig. 9 Sensor-based Wall Following Controller Results, (b) Real Robot Results 1, (c) Real Robot Results 2
(a) Simulation Results and (b) Real Robot Results
6
International Conference on Fascinating Advancement in Mechanical Engineering (FAME2008),
11-13, December 2008
on Robotics and Automation, vol. 3, pp. 2174-2179, [11] Gaskett, C., Fletcher, L. and Zelinsky, A.,
1998. “Reinforcement Learning for a Vision Based Mobile
[8] Sehad, S. and Touzet, C., “Self-Organizing Map for Robot”, Available: http://users.rsise.anu.edu.au/~rsl/r
Reinforcement Learning: Obstacle-Avoidance with sl_papers/2000iros-nomad.pdf (Accessed: 2006, May
Khepera”, Proceedings from Perception to Action 31).
Conference, pp. 420-423, 1994. [12] Webots. http://www.cyberbotics.com. Commercial
[9] Iida, M., Sugisaka, M. and Shibatam K., “Application Mobile Robot Simulation Software.
of Direct-Vision-Based Reinforcement Learning to a [13] “Semi-Online Neural-Q Learning”, Available:
Real Mobile Robot”, Available: http://shws.cc.oita- http://www.tdx.ce
u.ac.jp/~shibata/pub/ICONIP02-Iida.pdf (Accessed: sca.es/TESIS_UdG/AVAILABLE/TDX-0114104-
2006, June 24). 123825//tm cp3de3.pdf (2006, June 30).
[10] Shibata, K. and Iida, M., “Acquisition of Box Pushing
by Direct-Vision-Based Reinforcement Learning”,
Available: http://shws.cc.oita-u.ac.jp/~shibata/pub/SIC
E03.pdf (Accessed: 2006, June 24).