0% found this document useful (0 votes)
106 views

Application of Neural Q-Learning Controllers On The Khepera II Via Webots Software

The document is a summary of a paper presented at the International Conference on Fascinating Advancement in Mechanical Engineering (FAME2008) from December 11-13, 2008. It discusses applying neural Q-learning controllers on the Khepera II mobile robot using the Webots simulation software. The summary focuses on reinforcement learning algorithms like Q-learning and their application to problems in mobile robotics, specifically for behaviors like obstacle avoidance and wall following. Neural networks are used as function approximators for large state-action spaces that arise. The paper also discusses prior work combining reinforcement learning with neural networks and different types of controllers using sensors and computer vision.

Uploaded by

techlab
Copyright
© © All Rights Reserved
0% found this document useful (0 votes)
106 views

Application of Neural Q-Learning Controllers On The Khepera II Via Webots Software

The document is a summary of a paper presented at the International Conference on Fascinating Advancement in Mechanical Engineering (FAME2008) from December 11-13, 2008. It discusses applying neural Q-learning controllers on the Khepera II mobile robot using the Webots simulation software. The summary focuses on reinforcement learning algorithms like Q-learning and their application to problems in mobile robotics, specifically for behaviors like obstacle avoidance and wall following. Neural networks are used as function approximators for large state-action spaces that arise. The paper also discusses prior work combining reinforcement learning with neural networks and different types of controllers using sensors and computer vision.

Uploaded by

techlab
Copyright
© © All Rights Reserved
You are on page 1/ 8

International Conference on Fascinating Advancement in Mechanical Engineering (FAME2008),

11-13, December 2008

Application of Neural Q-Learning


Controllers on the Khepera II via Webots
Software
Velappa Ganapathy and Wen Lik Dennis Lui

instead must discover which actions yields the


Abstract— In recent years, there has been an most reward by trying them. Figure 1 shows the
increasing amount of research performed in the area of agent-environment interaction in reinforcement
mobile robotics. As such, numerous strategies had learning [1]. The Q-Learning algorithm proposed
been proposed to incorporate various fundamental
navigation behaviors such as obstacle avoidance, wall by Watkins, C. and Dayan, P. [2] is a widely used
following and path planning to mobile robots. These implementation of reinforcement learning based
controllers were developed using different methods on dynamic programming technique and temporal
and techniques which range from the traditional logic difference methods [1]. The algorithm estimates
controllers to neural controllers. Logic controllers the expected discounted numerical signal reward
require the programmer to specify the required action
Q(s,a) by taking action a at state s. This will bring
for all states of the mobile robot and neural networks,
such as the popular backpropagation feed-forward the robot to the next state, s’. The algorithm
neural network, will require the presence of a teacher further estimates the numerical signal for the next
during learning. To achieve a fully autonomous mobile state Q(s’,a’) assuming action a’ is taken in state
robot, the robot should be incorporated with the ability s’. Then, using the results of each action, it
to learn from its own experience. Supervised learning updates the Q-values of the Q-table according to
is an important kind of learning. However, it alone
the following equation,
would not be sufficient for learning from interaction.
To enable, the mobile robot to learn through
interaction with the environment, the reinforcement Q(s, a ) ← Q(s, a ) + α [r + γ max a ' Q (s ' , a ') − Q(s, a )] (1)
learning algorithms are investigated. In this paper, the
Neural Q-Learning algorithm was implemented on the where α is the learning rate, γ is the discount rate
Khepera II via the Webots software. The designed and r is the immediate reinforcement. The nature
controllers include both sensor and vision based of the Q-learning algorithm towards problems
controllers. These controllers are capable of exhibiting
obstacle avoidance and wall following behaviors. In with a discrete set of states and actions make it
addition, an obstacle avoidance controller which is very suitable for the development of mobile robot
based on a combination of sensor and visual inputs via navigation behavior such as obstacle avoidance
fuzzy logic was proposed. and wall following.

Keywords—Reinforcement Learning, Neural Q-


Learning, Fuzzy Logic, Obstacle Avoidance, Wall
Agent
Following, Khepera II, Webots.
reward rt
I. INTRODUCTION state st action at
rt+1

A S defined by Sutton, R. S. and Barto, A. G.


[1], reinforcement learning is learning what
to do - so as to maximize the numerical signal
st+1
Environment

reward. The learner is not told which actions to


Fig. 1 The Agent-Environment Interaction in Reinforcement
take, as in most forms of machine learning, but Learning

Velappa Ganapathy is with the School of Engineering, However, its original standard tabular
Monash University Sunway Campus, Jalan Lagoon Selaton,
46150 Banday Sunway (phone: +60-3-55146250; fax:+60-3-
formation used to hold Q-values would not yield
55146207; e-mail: [email protected]). an efficient system. For instance, a robot with
Wen Lik Dennis Lui., is with the School of Engineering, eight sensors which has an input range of 0-1022
Monash University Clayton Campus, 3168 VIC Australia. (e-
mail:[email protected]).
for each sensor with five actions to choose from
will require a (1.1995 x 1024) x 5 Q-table. To
Department of Mechanical Engineering, Mepco Schlenk Engineering College, Sivakasi, India

efficiently update and read the Q-values from behavior for the robot and its inputs are made up
such a large table would impose a serious of the 8 proximity sensors.
problem. The only was to learn anything at all on The other category of reinforcement learning
these tasks is to generalize from previously controllers is the vision-based controllers. The
experienced states to ones that have never been previously discussed works are all sensor-based
seen. As such, the standard tabular formation had controllers. The most basic vision based
been replaced by function approximators such as controllers are those which directly input the
neural networks. captured image to the neural network. This is
illustrated in the work of Iida, M. et al [9] and
II. RELATED WORKS Shibata, K. and Iida, M. [10]. The former utilized
a linear grayscale camera (1x64 pixels) and the
Apart from applying neural networks to Q- latter utilized a CCD camera (320x240 pixels).
learning, it has also been applied onto actor-critic The actor-critic architecture was utilized to
architectures. The types of neural network enable the mobile robot to orientate itself towards
utilized for reinforcement learning algorithms are an object and pushes it towards the wall.
the backpropagation feed-forward neural To further improve the behavior of vision
network, recurrent neural network and the self based controllers, Gaskett, C. et al [11] had
organizing maps. Jakša, R. et al [3], Yang, G.-S. utilized a continuous state, continuous action
et al [4] and Huang, B.-Q. et al [5] had similarly reinforcement learning algorithm based on a
approached the problem of mobile robot multilayered feed-forward neural network
navigation through the combination of the combined with an interpolator. This interpolation
multilayer feed-forward neural network with scheme is known as ‘wire-fitting’. The ‘wire-
reinforcement learning algorithms. Jakša, R. et al fitting’ function is a moving least squares
[3] had utilized an actor critic architecture interpolator which is used to increase the speed
whereas the latter two had utilized the Q-Learning experienced during the Q-value updating process.
algorithm. All results obtained are verified via It allows the updating process to be conducted
simulation results only. In addition, the developed whenever it is convenient. The simulation results
reinforcement learning controllers are entirely show that the robot is capable of demonstrating
based on the mobile robots’ sensors. wandering and servoing behaviors through trial
The main difference between the recurrent and error using reinforcement learning.
neural networks and the backpropagation feed-
forward neural networks is their internal structure. III. SYSTEM OVERVIEW
The input layer of the recurrent neural network is Webots [12] is a commercial mobile robot
divided into two parts; the true input units and the simulation software used by over 250 universities
context units. The context units simply hold a and research centers worldwide to model,
copy of the activations of the hidden units from program and simulate mobile robotics. The main
the previous time steps. In 1998, Onat, A. et al [6] reason for its increasing popularity is due to its
had thoroughly discussed the architecture, ability to reduce the overall development time.
learning algorithms and internal representation of Using this platform, a flexible simulation for the
recurrent neural networks for reinforcement Khepera II was developed. Some of the notable
learning and had performed comparisons across features in the simulation are the custom maze
the different types of network architectures and design feature, repositioning and reorientation of
the sensors, changing wall and floor textures,
learning algorithms through a simple problem. At
light intensities, etc.
the same time, Cervera, E. and del Pobil, A.P. [7]
In addition, Webots was interfaced with the
had not only applied the recurrent neural network
Microsoft Visual C++ .Net 2002 integrated
for a sensor-based goal finding task, but the duo development environment by using a combination
extended it by proposing a new method for state of MC++, C++ and C programs. Furthermore, the
identification that eliminates sensor ambiguities. C programs were interfaced to Matlab 7.1
The implementation of self organizing maps through the Matlab engine. Matlab was further
for Q-learning was further illustrated by Sehad, S. used to communicate to the serial port such that
and Touzet, C. [8]. This network learns without control commands could be sent to the real robot
the requirements of supervision and it could be and vice-versa via the radio base and radio turret.
able to detect irregularities and correlations in the The interaction of the various modules of system
input, and adapt to that accordingly. The pair had is shown in Fig. 2.
used the self organizing maps together with Q-
learning to develop an obstacle avoidance

2
International Conference on Fascinating Advancement in Mechanical Engineering (FAME2008),
11-13, December 2008

Neural
(v) Determine an action, a according to the
Network, Boltzmann Probability Distribution (during
Image
Processing learning) or the equation a = max(Q(s,a))
MC++ & C++ Tool Box
Autodesk 3ds Supervisor (after learning).
Max
MC++, C++ & C (vi) Robot takes action, a and reach a new
Controller
position. Get current state.
(vii) If a collision occurred, a negative
numerical reward signal will be granted
Matlab Session
and the robot is reset back to its initial
position.
(viii) Then, generate Qtarget according to the
equation:

Qt arget (st , at ) = r (st , at , st +1 ) + γ maxQ(st +1 , at +1 ) (2)


at +1εA
DBots

where γ is the discount rate (0 ≤ γ ≤ 1),


r (st , a t , s t +1 ) is the reward signal
Cross-Compilation or
Remote Connection
Simulation assigned to action at for bringing the robot
Fig. 2 System Overview
from state st to state st+1.
(ix) Construct the error vector by using Qtarget
The Neural Q-Learning algorithms are mostly for the output unit corresponding to the
action taken and 0 for other output units.
written in C. It could be easily interchanged from
one algorithm to another algorithm by using (x) Repeat (iii)-(ix) until the robot is able to
Graphical User Interface (GUI) developed for the demonstrate the expected behavior.
robot controller. By treating Matlab as a
background computation engine, the C programs To allow the mobile robot to explore the
are able to make use of the neural network environment first and slowly converge to
toolbox in Matlab. Thus, data will be transmitted exploiting the learnt policy, the Boltzmann
back and forth from the simulation to Matlab and Probability Distribution was utilized. The
vice-versa. Boltzmann probability could be denoted by the
following equation,
IV. NEURAL Q-LEARNING
1
In this work, a total of four controllers were prob(a k ) = exp(Q (st , a k ) / Tt ) (3)
developed. These controllers are: f
where
(i) Sensor-based Obstacle Avoidance Controller
(ii) Sensor-based Wall Following Controller
(iii) Vision-based Obstacle Avoidance Controller
f = ∑ exp(Q(s t , a ) / Tt ) (4)
a
(iv) Obstacle Avoidance Controller based on a
Combination of Sensor and Visual Inputs
and t is the current iteration and k is the index of
A common learning algorithm applies to all the action selected. The Boltzmann Probability
these controllers. The Neural Q-Learning Distribution was originally derived for physics
algorithm implemented is as follows, and chemistry applications. Nevertheless, it was
adopted for the use in reinforcement learning
(i) Initialize the neural network in Matlab and algorithms to define a policy which becomes
randomly assign the weights of the neural greedier over time. The key parameter to ensure
network. this policy is the T parameter, which is known as
(ii) Define the initial position of the Khepera II the temperature. It controls the randomness of the
in the simulation. action selection by setting it high at the beginning
(iii) Obtain the sensor readings from the of the learning phase and slowly decreasing on
infrared sensors/ visual input from the each iteration by the following equation,
camera/combination of both.
(iv) Obtain Q(s,a) for each action by Tt +1 = Tmin + β (Tt − Tmin ) (5)
substituting the current state and action
into the neural network.
Department of Mechanical Engineering, Mepco Schlenk Engineering College, Sivakasi, India

where Tmin and β (0 < β ≤ 1) are constants. Thus, B. Sensor-based Wall Following Controller
as T approaches Tmin, the robot will change from This controller is basically an extension over
exploration to exploitation of the learnt policy. the sensor-based obstacle avoidance RL
The robot will have a total of 5 actions to select at controller. The network configurations and
any state. These actions are illustrated in Fig. 3. parameters used are the same for both controllers.
The only difference between this and the previous
controller is its reward function. By altering the
reward function, the robot learns to refine its
behavior from an obstacle avoidance behavior to
a much refined wall following behavior. The
reward function is designed as,

Forward Motion: +0.05


Turn Left: +0.00
Fig. 3 The Five Actions of the Khepera II Turn Right: +0.00
Rotate Left: -0.10
A. A Sensor-based Obstacle Avoidance Rotate Right: -0.10
Controller
This controller is very much similar to the A collision is defined to take place when any
works of Yang, G.-S. et al [4] and Huang, B.-Q. one of the five front sensors reads a value
et al [5]. However, they had only validated this in exceeding 600. If so, the reward for that iteration
simulation only. In this work, it has been further is -10.00. However if the reading is in between
extended to the validation of the controller on the 250-600, then the final reward obtained is +1.00.
actual robot. The input states are the 8 sensor
readings. The neural network design which has
C. Vision-based Obstacle Avoidance
successfully demonstrated the desired result is a 3
Controller
layer feed-forward backpropagation neural
network. Fig. 4 shows the neural network This controller represents its states in a totally
architecture for this controller. It has 3 layers with different way if compared to the sensor-based
8 neurons on the input layer (pure linear controllers. The visual input is acquired from the
activation function), 16 neurons on the hidden K213 Linear Grayscale Camera. It provides an
layer (tangent sigmoid activation function) and 5 image with an array size of 1x64 pixels. To allow
neurons on the output layer (pure linear activation the robot to acquire an obstacle avoidance
function). The Variable Learning Rate behavior through the linear grayscale input, some
Backpropagation training algorithm is used to modifications are performed to the original
train the neural network and it has eight input environment. The surrounding wall textures will
which ranges from 0 to 1022 for each infra-red be required to be changed to black and white
distance sensors. The reward function is designed stripes. The number and width of these stripes
as, changes as the robot moves toward or away from
the walls. As such, this will serve as a pattern for
Forward Motion: +0.30 the robot to determine whether it is close to a
Turn Left: +0.15 wall, based on the visual input. Fig. 5 shows the
Turn Right: +0.15 Khepera II with the K213 Linear Grayscale
Rotate Left: -0.10 Camera Module.
Rotate Right: -0.10

Fig. 4 Neural Network Architecture


Fig. 5 Khepera II with the K213 Linear Grayscale Camera

A collision is defined to take place when any Looking at the size of the input image, it
one of the five front sensors reads in a value suggests that if the image is to be applied directly
exceeding 600. If so, the reward for that iteration to the neural network, the neural network will
is -10.00. require 64 neurons on its input layer. However,
via several experiments, it was found that the
neural network was not able to generalize. Thus,

4
International Conference on Fascinating Advancement in Mechanical Engineering (FAME2008),
11-13, December 2008

the 64 grayscale values are divided into 8 • If (ds0 is medium) and (a0 is medium) then
segments, each containing 8 pixels. For each (output1 is medium) (1)
segment, the average grayscale value is calculated • If (ds0 is medium) and (a0 is low) then
and fed to the neural network on each iteration. (output1 is medium) (1)
This conversion results in 8 average grayscale • If (ds0 is low) and (a0 is high) then (output1
values for the 8 segments. Similarly, the neural is high) (1)
network architecture shown in Fig. 4 could be • If (ds0 is low) and (a0 is medium) then
applied to this controller. The reward function (output1 is medium) (1)
was designed similar to the sensor-based obstacle • If (ds0 is low) and (a0 is low) then (output1
avoidance controller. is low) (1)

D. Obstacle Avoidance Controller based on a


Combination of Sensor and Visual Inputs
This controller was proposed in order to
overcome the weakness of the vision-based
obstacle avoidance controller. Its weakness will
be illustrated in the following section. This
controller does not only avoid obstacles but it Fig. 7 FLC Output Membership Function
also stays away from black objects. This behavior
is created by implementing a Fuzzy Logic The neural network architecture implemented for
Controller (FLC). The FLC fuzzifies the 8 sensor this controller is a 3 layer feed-forward
and 8 average grayscale values into 8 outputs. It backpropagation neural network. Likewise, it has
is designed in such a way that the sensor readings 8 neurons on the input layer (pure linear
take more priority over the visual input. For each activation function), but 32 neurons on the hidden
input, three membership functions are specified layer (tangent sigmoid activation function) and 5
i.e. low, medium and high. The membership neurons on the output layer (pure linear activation
functions for the sensor inputs are as illustrated in function). The Variable Learning Rate
Fig. 6(a) and the membership functions for the Backpropagation training algorithm was utilized
averaged grayscale values of the linear grayscale and its reward function is designed as,
image are as illustrated in Fig. 6(b).
Forward Motion: +0.50
Turn Left: +0.15
Turn Right: +0.15
Rotate Left: -0.20
Rotate Right: -0.20

A collision is defined to take place when any


one of the five front sensors reads a value
(a) (b) exceeding 600. If so, the reward for that iteration
Fig. 6 Membership Function for (a) Sensor Readings and (b) is -10.00. Then if the total number of inputs in the
Averaged Grayscale Values current state exceeding 200 is less than the total
number of inputs in the next state exceeding 200,
As there are 8 sensors and 8 averaged then the original numerical reward assigned for
grayscale values, the rules are relatively easy to taking that action gets an additional 0.50.
define. The combination of the sensor input and
image segment will totally depend on its position. V. RESULTS AND DISCUSSION
For example, sensor ds0 which is located on the
Before presenting the results of the controllers,
left side of the Khepera will be combined with the
there is a need to develop certain measures to
averaged value of the left most segment of the
evaluate the function approximator, which is the
grayscale image. For the last two averaged values,
neural network in this case. Most supervised
there are no other options but to pair it with the
learning seeks to minimize the mean-squared
two rear sensors. However, this is not to worry as
error (MSE) over some distribution, P, of the
the two readings from the rear sensors are
inputs. Another measure is the number of epochs
normally not taken into consideration as there are
it takes the neural network to acquire a behavior.
no backward motions. Fig. 7 shows the
However, with this algorithm, it is very hard to
membership function of the FLC outputs and the
ensure that a better approximation at some state
rules for each pair of inputs are as follows,
can be gained without the expense of worse
• If (ds0 is high) then (output1 is high) (1)
approximation at other states. This widely known
Department of Mechanical Engineering, Mepco Schlenk Engineering College, Sivakasi, India

issue is termed as the interference problem. As C. Vision-based Obstacle Avoidance


such, the most effective performance measure of Controller
the neural network performance is through The results shown in Fig. 10 suggest that the
observation of its actual behavior. Although this neural network is able to generalize when the
does not provide an accurate measure of its average grayscale values are fed into the neural
performance, however, it is the best way to ensure network. Although the robot learns how to avoid
the function approximator had successfully colliding onto walls through the visual input,
acquired the desired behavior. Video clips of the however, its limited field of view has resulted in
robot were recorded for observation purposes. side collisions. Due to this, the controller was not
However, to illustrate the results in this paper, the tested on the real robot to avoid unnecessary
trajectory taken by the robot in the simulated and damage.
actual environment will be drawn.

A. Sensor-based Obstacle Avoidance


Controller
It could be seen in Fig. 8 that the robot has
successfully demonstrated an obstacle avoidance
behavior. In the simulation, more obstacles are (a) (b)
present. This is due to the convenience of the
Fig. 10 Vision-based Obstacle Avoidance Controller (a)
custom maze building feature in the supervisor Simulation 1 and (b) Simulation 2
controller program developed in Webots.
However, for the real robot, the environment is
D. Obstacle Avoidance Controller based on a
made much simpler due to mobility issues. The
Combination of Sensor and Visual Inputs
learning time for each neural network which has
its weights randomly assigned differs This controller incorporates two behaviors with
significantly. Thus, the number of epochs during the same goal; the obstacle avoidance behavior
each learning phase does not indicate anything at together with the avoidance of dark object by
all. implementing a FLC. To test this controller, black
objects are placed on strategic locations on the
walls which act as a secondary guide for the robot
to reach its final position. The sensor readings
keep it safe from wall collisions on the side. This
makes this controller more superior to the vision-
based controller. Fig. 11 shows the trajectory
taken by the robot in both simulated and actual
(a) (b)
environment.
Fig. 8 Sensor-based Obstacle Avoidance Controller (a)
Simulation Results and (b) Real Robot Results

B. Sensor-based Wall Following Controller


The sensor-based wall following behavior was
successfully acquired by the robot and the results
are illustrated in Fig. 9. As compared to the
previous controller, the movement of the robot is
much more refined. This is because the robot (a) (b)
travels following the position of the walls. To
achieve this behavior will only require slight
modifications in the reward function.

(c)
(a) (b) Fig. 11 Obstacle Avoidance Controller based on
Combination of Sensor and Vision Inputs (a) Simulation
Fig. 9 Sensor-based Wall Following Controller Results, (b) Real Robot Results 1, (c) Real Robot Results 2
(a) Simulation Results and (b) Real Robot Results

6
International Conference on Fascinating Advancement in Mechanical Engineering (FAME2008),
11-13, December 2008

E. Discussion new ones, they will be replaced. This ensures that


The main advantage of reinforcement learning the database is maintained at the optimum size
is its ability to allow the robot to learn through and no extra unnecessary training time is taken p
interaction. The environment can be totally due to duplicate samples. However, there are still
unknown. It learns through the rewards it obtains issues regarding the training required for each
for each action taken under different states. This training cycle when the database scales up.
concept is similar to how humans learn. Humans Furthermore, an efficient updating procedure is
learn naturally through experience. For example, required for the database such that minimum time
we learn not to touch a cactus after we get our is required for the updating process.
fingers pricked. The pain has resulted in a
negative reward. As such, through an VI. CONCLUSION
accumulation of different experiences, humans In this paper, four reinforcement learning
acquire new skills and behavior. Of course, for controllers based on the Neural Q-Learning
the case of reinforcement learning controller, it algorithm were designed and tested on the actual
will require further advances in its theory before and simulated robot using the Webots
it could reach such levels. Commercial Robot Simulation software. In
The drawback about the learning algorithm is addition to building a flexible simulation
the large amount of unknowns. All the parameters environment for the Khepera II, further extension
such as the discount rate, initial temperature was made to the work by Yang, G.-S. et al [4] and
parameter, reward function and neural network Huang, B.-Q. et al [5] by validating the sensor-
parameters are unknowns. There are no formulas based obstacle avoidance controller on the actual
or guidelines to select an optimum set of robot. This eventually leads to the investigation
parameters for the problem at hand. This has of the wall following behavior and vision based
resulted in a major drawback when different controllers with the pros and cons of the learning
configurations are being experimented. algorithm observed during experiments being
Performing analysis on the neural network is highlighted. In conclusion, the robot was able to
already complex enough but the additional acquire its desired behavior through its
parameters introduced by the learning algorithm interaction with the environment.
makes the analysis even tougher. Due to large
number of parameters which could be altered, it is ACKNOWLEDGMENT
often quite hard to identify the actual reason for
The authors thank Monash University Malaysia
the failure of the robot to learn a desired
for the support of this work.
behavior. The only way to identify how these
parameters influence the system is through
REFERENCES
experience. Experimental results reveal that the
[1] Sutton, R.S. and Barto, A.G., “Reinforcement
discount rate should always start from a lower
Learning: An Introduction”, MIT Press, 1998.
value such that the neural network, which has its [2] Watkins, C. and Dayan, P., “Q-Learning”, Machine
weights randomly assigned, could settle down to Learning, vol.8, pp.279-292, 1992.
an equal state before the actual learning phase [3] Jakša, R., Sinčák, P. and Majerník, P.,
“Backpropagation in Supervised and Reinforcement
starts. Then, the discount rate is slowly increased
Learning for Mobile Robot Control”, Available:
as the number of iteration increases. Furthermore, http://neuron-ai.tuke.sk/~jaksa/publication s/Jaksa-
it was found that the Variable Rate Learning Sincak-Majernik-ELCAS99.pdf (Accessed: 2006, April
Backpropagation training algorithm works well 24).
for the value approximation problem. [4] Yang, G.-S., Chen, E.-K. and An, C.-W., “Mobile
Robot Navigation Using Neural Q-Learning”,
Another major drawback with this algorithm is Proceedings of 2004 International Conference on
the presence of the interference problem. One Machine Learning and Cybernetics, vol. 1, pp. 48-52,
approach to this problem is to adopt the Semi- 2004.
Online Neural Q-Learning algorithm [13]. This [5] Huang, B.-Q., Cao, G.-Y. and Guo, M.,
“Reinforcement Learning Neural Network to the
network acts locally and assures that learning in Problem of Autonomous Mobile Robot Obstacle
one zone does not affect the learning in other Avoidance”, Proceedings of 2005 International
zones. It uses a database of learning samples. The Conference on Machine Learning and Cybernetics,
main goal of this database is to include a vol.1, pp. 85-89, 2005.
[6] Onat, A., Kita, H. and Nishikawa, Y., “Recurrent
representative set of visited learning samples, Neural Networks for Reinforcement Learning:
which is repeatedly used to update the Neural Q- Architecture, Learning Algorithms and Internal
Learning algorithm. The immediate advantage of Representation”, 1998 IEEE International Joint
this is the stability of the learning process and its Conference on Neural Networks Proceedings, vol. 3,
pp. 2010-2015, 1998.
convergence even in difficult problems. All the [7] Cervera, E. and del Pobil, A.P., “Eliminating Sensor
samples in the database will be compared to the Ambiguities Via Recurrent Neural Networks in Sensor-
new ones. If the old ones are found similar to the Based Learning”, 1998 IEEE International Conference
Department of Mechanical Engineering, Mepco Schlenk Engineering College, Sivakasi, India

on Robotics and Automation, vol. 3, pp. 2174-2179, [11] Gaskett, C., Fletcher, L. and Zelinsky, A.,
1998. “Reinforcement Learning for a Vision Based Mobile
[8] Sehad, S. and Touzet, C., “Self-Organizing Map for Robot”, Available: http://users.rsise.anu.edu.au/~rsl/r
Reinforcement Learning: Obstacle-Avoidance with sl_papers/2000iros-nomad.pdf (Accessed: 2006, May
Khepera”, Proceedings from Perception to Action 31).
Conference, pp. 420-423, 1994. [12] Webots. http://www.cyberbotics.com. Commercial
[9] Iida, M., Sugisaka, M. and Shibatam K., “Application Mobile Robot Simulation Software.
of Direct-Vision-Based Reinforcement Learning to a [13] “Semi-Online Neural-Q Learning”, Available:
Real Mobile Robot”, Available: http://shws.cc.oita- http://www.tdx.ce
u.ac.jp/~shibata/pub/ICONIP02-Iida.pdf (Accessed: sca.es/TESIS_UdG/AVAILABLE/TDX-0114104-
2006, June 24). 123825//tm cp3de3.pdf (2006, June 30).
[10] Shibata, K. and Iida, M., “Acquisition of Box Pushing
by Direct-Vision-Based Reinforcement Learning”,
Available: http://shws.cc.oita-u.ac.jp/~shibata/pub/SIC
E03.pdf (Accessed: 2006, June 24).

You might also like