0% found this document useful (0 votes)

7 views

A Painless Q-Learning Tutorial

This document provides a comprehensive tutorial on Q-learning, illustrating the concept through a numerical example involving an agent navigating a building with multiple rooms. The tutorial explains the Q-learning algorithm, including the use of reward and Q matrices, and how the agent learns to reach a goal state through exploration and experience. It also outlines the steps for implementing the algorithm and demonstrates the learning process with examples of state transitions and Q value updates.

Uploaded by

23bme020

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

A Painless Q-Learning Tutorial

Uploaded by

23bme020

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

3/17/2020 A Painless Q-Learning Tutorial

Contact (); Links ();

Home (); AI (); Robotics (); Notes (); About ();

Home > AI Main > Path-Finding > Q-Learning > Tutorial

Path Q-Learning .*; News Links:

Finding 2020 (0)
Q-Learning March (0)
A* February (0)
Step-By-Step Tutorial
January (0)
This tutorial introduces the concept of Q-learning through a simple but comprehensive 2019 (0)
numerical example. The example describes an agent which uses unsupervised training to 2018 (0)
learn about an unknown environment. You might also find it helpful to compare this example
with the accompanying source code examples. 2017 (0)
2016 (0)
Suppose we have 5 rooms in a building connected by doors as shown in the figure below.
We'll number each room 0 through 4. The outside of the building can be thought of as one big 2015 (0)
room (5). Notice that doors 1 and 4 lead into the building from room 5 (outside). 2014 (0)
2013 (32)
2012 (242)
2011 (217)
2010 (185)
2009 (20)

Search News Links:

Search News

We can represent the rooms on a graph, each room as a node, and each door as a link.

For this example, we'd like to put an agent in any room, and from that room, go outside the
building (this will be our target room). In other words, the goal room is number 5. To set this
room as a goal, we'll associate a reward value to each door (i.e. link between nodes). The
doors that lead immediately to the goal have an instant reward of 100. Other doors not directly
connected to the target room have zero reward. Because doors are two-way ( 0 leads to 4, and
4 leads back to 0 ), two arrows are assigned to each room. Each arrow contains an instant
reward value, as shown below:

mnemstudio.org/path-finding-q-learning-tutorial.htm 1/6
3/17/2020 A Painless Q-Learning Tutorial

Of course, Room 5 loops back to itself with a reward of 100, and all other direct connections to
the goal room carry a reward of 100. In Q-learning, the goal is to reach the state with the
highest reward, so that if the agent arrives at the goal, it will remain there forever. This type of
goal is called an "absorbing goal".
Imagine our agent as a dumb virtual robot that can learn through experience. The agent can
pass from one room to another but has no knowledge of the environment, and doesn't know
which sequence of doors lead to the outside.
Suppose we want to model some kind of simple evacuation of an agent from any room in the
building. Now suppose we have an agent in Room 2 and we want the agent to learn to reach
outside the house (5).

The terminology in Q-Learning includes the terms "state" and "action".

We'll call each room, including outside, a "state", and the agent's movement from one room to
another will be an "action". In our diagram, a "state" is depicted as a node, while "action" is
represented by the arrows.

Suppose the agent is in state 2. From state 2, it can go to state 3 because state 2 is connected
to 3. From state 2, however, the agent cannot directly go to state 1 because there is no direct
door connecting room 1 and 2 (thus, no arrows). From state 3, it can go either to state 1 or 4 or
back to 2 (look at all the arrows about state 3). If the agent is in state 4, then the three possible
actions are to go to state 0, 5 or 3. If the agent is in state 1, it can go either to state 5 or 3.
From state 0, it can only go back to state 4.

mnemstudio.org/path-finding-q-learning-tutorial.htm 2/6
3/17/2020 A Painless Q-Learning Tutorial
We can put the state diagram and the instant reward values into the following reward table,
"matrix R".

The -1's in the table represent null values (i.e.; where there isn't a link between nodes).
For example, State 0 cannot go to State 1.
Now we'll add a similar matrix, "Q", to the brain of our agent, representing the memory of what
the agent has learned through experience. The rows of matrix Q represent the current state of
the agent, and the columns represent the possible actions leading to the next state (the links
between the nodes).
The agent starts out knowing nothing, the matrix Q is initialized to zero. In this example, for the
simplicity of explanation, we assume the number of states is known (to be six). If we didn't
know how many states were involved, the matrix Q could start out with only one element. It is
a simple task to add more columns and rows in matrix Q if a new state is found.
The transition rule of Q learning is a very simple formula:

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

According to this formula, a value assigned to a specific element of matrix Q, is equal to the
sum of the corresponding value in matrix R and the learning parameter Gamma, multiplied by
the maximum value of Q for all possible actions in the next state.

Our virtual agent will learn through experience, without a teacher (this is called unsupervised
learning). The agent will explore from state to state until it reaches the goal. We'll call each
exploration an episode. Each episode consists of the agent moving from the initial state to the
goal state. Each time the agent arrives at the goal state, the program goes to the next episode.

The Q-Learning algorithm goes as follows:

1. Set the gamma parameter, and environment rewards in matrix R.
2. Initialize matrix Q to zero.
3. For each episode:
Select a random initial state.
Do While the goal state hasn't been reached.
Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all possible actions.
Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all
actions)]
Set the next state as the current state.
End Do
End For
The algorithm above is used by the agent to learn from experience. Each episode is equivalent
to one training session. In each training session, the agent explores the environment
(represented by matrix R ), receives the reward (if any) until it reaches the goal state. The
purpose of the training is to enhance the 'brain' of our agent, represented by matrix Q. More
training results in a more optimized matrix Q. In this case, if the matrix Q has been enhanced,
instead of exploring around, and going back and forth to the same rooms, the agent will find the
fastest route to the goal state.
The Gamma parameter has a range of 0 to 1 (0 <= Gamma > 1). If Gamma is closer to zero,
the agent will tend to consider only immediate rewards. If Gamma is closer to one, the agent
will consider future rewards with greater weight, willing to delay the reward.
To use the matrix Q, the agent simply traces the sequence of states, from the initial state to
goal state. The algorithm finds the actions with the highest reward values recorded in matrix Q
for current state:
Algorithm to utilize the Q matrix:
1. Set current state = initial state.
2. From current state, find the action with the highest Q value.
3. Set current state = next state.
4. Repeat Steps 2 and 3 until current state = goal state.
The algorithm above will return the sequence of states from the initial state to the goal state.

mnemstudio.org/path-finding-q-learning-tutorial.htm 3/6
3/17/2020 A Painless Q-Learning Tutorial

Q-Learning Example By Hand

To understand how the Q-learning algorithm works, we'll go through a few episodes step by
step. The rest of the steps are illustrated in the source code examples.
We'll start by setting the value of the learning parameter Gamma = 0.8, and the initial state as
Room 1.
Initialize matrix Q as a zero matrix:

Look at the second row (state 1) of matrix R. There are two possible actions for the current
state 1: go to state 3, or go to state 5. By random selection, we select to go to 5 as our action.

Now let's imagine what would happen if our agent were in state 5. Look at the sixth row of the
reward matrix R (i.e. state 5). It has 3 possible actions: go to state 1, 4 or 5.

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100

Since matrix Q is still initialized to zero, Q(5, 1), Q(5, 4), Q(5, 5), are all zero. The result of this
computation for Q(1, 5) is 100 because of the instant reward from R(5, 1).
The next state, 5, now becomes the current state. Because 5 is the goal state, we've finished
one episode. Our agent's brain now contains an updated matrix Q as:

For the next episode, we start with a randomly chosen initial state. This time, we have state 3
as our initial state.
Look at the fourth row of matrix R; it has 3 possible actions: go to state 1, 2 or 4. By random
selection, we select to go to state 1 as our action.
Now we imagine that we are in state 1. Look at the second row of reward matrix R (i.e. state
1). It has 2 possible actions: go to state 3 or state 5. Then, we compute the Q value:

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Q(1, 5) = R(1, 5) + 0.8 * Max[Q(1, 2), Q(1, 5)] = 0 + 0.8 * Max(0, 100) = 80

We use the updated matrix Q from the last episode. Q(1, 3) = 0 and Q(1, 5) = 100. The result
of the computation is Q(3, 1) = 80 because the reward is zero. The matrix Q becomes:

mnemstudio.org/path-finding-q-learning-tutorial.htm 4/6
3/17/2020 A Painless Q-Learning Tutorial

The next state, 1, now becomes the current state. We repeat the inner loop of the Q learning
algorithm because state 1 is not the goal state.

So, starting the new loop with the current state 1, there are two possible actions: go to state 3,
or go to state 5. By lucky draw, our action selected is 5.

Now, imaging we're in state 5, there are three possible actions: go to state 1, 4 or 5. We
compute the Q value using the maximum value of these possible actions.

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100

The updated entries of matrix Q, Q(5, 1), Q(5, 4), Q(5, 5), are all zero. The result of this
computation for Q(1, 5) is 100 because of the instant reward from R(5, 1). This result does not
change the Q matrix.
Because 5 is the goal state, we finish this episode. Our agent's brain now contain updated
matrix Q as:

If our agent learns more through further episodes, it will finally reach convergence values in
matrix Q like:

This matrix Q, can then be normalized (i.e.; converted to percentage) by dividing all non-zero
entries by the highest number (500 in this case):

mnemstudio.org/path-finding-q-learning-tutorial.htm 5/6
3/17/2020 A Painless Q-Learning Tutorial

Once the matrix Q gets close enough to a state of convergence, we know our agent has
learned the most optimal paths to the goal state. Tracing the best sequences of states is as
simple as following the links with the highest values at each state.

For example, from initial State 2, the agent can use the matrix Q as a guide:
From State 2 the maximum Q values suggests the action to go to state 3.
From State 3 the maximum Q values suggest two alternatives: go to state 1 or 4.
Suppose we arbitrarily choose to go to 1.
From State 1 the maximum Q values suggests the action to go to state 5.
Thus the sequence is 2 - 3 - 1 - 5.

public void footer() {

mnemstudio.org/path-finding-q-learning-tutorial.htm 6/6

Ne40e v800r0220c00spc600 Release Notes
No ratings yet
Ne40e v800r0220c00spc600 Release Notes
49 pages
Abbott - FreeStyle Precision Pro - Operators Manual
No ratings yet
Abbott - FreeStyle Precision Pro - Operators Manual
186 pages
Quiz 4 CS50 2020
No ratings yet
Quiz 4 CS50 2020
3 pages
Speedtree Tutorial For NWN2
No ratings yet
Speedtree Tutorial For NWN2
9 pages
A2 Questions
No ratings yet
A2 Questions
7 pages
Python For Oil and Gas: Website - Linkedin - Youtube
No ratings yet
Python For Oil and Gas: Website - Linkedin - Youtube
13 pages
CBSE Computer and Communication Technology 2017 Board Questions
No ratings yet
CBSE Computer and Communication Technology 2017 Board Questions
11 pages
BLG_453E_Fall2425_HW3
No ratings yet
BLG_453E_Fall2425_HW3
4 pages
Tutorial Unit-1 A
No ratings yet
Tutorial Unit-1 A
3 pages
AA2024_Sem2_Assignment1
No ratings yet
AA2024_Sem2_Assignment1
11 pages
SLCC Mini-Case Marking Scheme (Rubrics)
No ratings yet
SLCC Mini-Case Marking Scheme (Rubrics)
6 pages
FTTI Assignment
No ratings yet
FTTI Assignment
5 pages
Data Handle Cls 11
No ratings yet
Data Handle Cls 11
13 pages
The Language of Bits: Computer Organisation and Architecture
No ratings yet
The Language of Bits: Computer Organisation and Architecture
72 pages
Lec 05 - Arithmetic Coding
No ratings yet
Lec 05 - Arithmetic Coding
44 pages
Science Fair Project Rubric - Google Docs
No ratings yet
Science Fair Project Rubric - Google Docs
1 page
A Boolean Array Puzzle
No ratings yet
A Boolean Array Puzzle
3 pages
stqaLabFile - Garima Gupta
No ratings yet
stqaLabFile - Garima Gupta
54 pages
Important Short Questions 1-Board Exams
No ratings yet
Important Short Questions 1-Board Exams
6 pages
AI Jigar Final File
No ratings yet
AI Jigar Final File
84 pages
Difference Between I++ and I I+1 in C Language
No ratings yet
Difference Between I++ and I I+1 in C Language
2 pages
1 Introduction - 16894420 - 2024 - 04 - 03 - 08 - 52
No ratings yet
1 Introduction - 16894420 - 2024 - 04 - 03 - 08 - 52
21 pages
CSC1016S 2022 Exam
No ratings yet
CSC1016S 2022 Exam
14 pages
Assignment 1
No ratings yet
Assignment 1
5 pages
Virtual Class Room
No ratings yet
Virtual Class Room
20 pages
BANA 3020 Midterm Theory Solution
No ratings yet
BANA 3020 Midterm Theory Solution
7 pages
Study Skills and Exam Strategies: Strategies For Understanding Code
No ratings yet
Study Skills and Exam Strategies: Strategies For Understanding Code
5 pages
What are similarity and dissimilarity measures_
No ratings yet
What are similarity and dissimilarity measures_
4 pages
7168computer_decrypted
No ratings yet
7168computer_decrypted
3 pages
Final Project
No ratings yet
Final Project
12 pages
16-311: Introduction To Robotics Mid-Term Examination Spring 2007 Professor Howie Choset Solutions
No ratings yet
16-311: Introduction To Robotics Mid-Term Examination Spring 2007 Professor Howie Choset Solutions
8 pages
Excel and AutoCAD - How To Link Excel With AutoCAD Using VBA
No ratings yet
Excel and AutoCAD - How To Link Excel With AutoCAD Using VBA
20 pages
FinalProject_Checkpoint2
No ratings yet
FinalProject_Checkpoint2
13 pages
203L01S16
No ratings yet
203L01S16
12 pages
HNDIT 1103-Structured Programming Q1. (2007 First Semester Exam - Q 1)
No ratings yet
HNDIT 1103-Structured Programming Q1. (2007 First Semester Exam - Q 1)
5 pages
lecture19
No ratings yet
lecture19
8 pages
Computer Science Department: Project Specification Cs052 - Translation of Programming Languages
No ratings yet
Computer Science Department: Project Specification Cs052 - Translation of Programming Languages
5 pages
OOP - Questions For Review
No ratings yet
OOP - Questions For Review
15 pages
Lab-01-Introduction To Python (133-20-0005)
No ratings yet
Lab-01-Introduction To Python (133-20-0005)
9 pages
Cs607 Collection of Old Papers
No ratings yet
Cs607 Collection of Old Papers
13 pages
icso_sample_paper_class-8_2024-25
No ratings yet
icso_sample_paper_class-8_2024-25
2 pages
LP Examples
No ratings yet
LP Examples
15 pages
Final Paper - Muhali
No ratings yet
Final Paper - Muhali
14 pages
Mtes1104 Rubrik Coursework
No ratings yet
Mtes1104 Rubrik Coursework
1 page
D R L F L S I G S: EEP Einforcement Earning For Urniture Ayout Imulation in Ndoor Raphics Cenes
No ratings yet
D R L F L S I G S: EEP Einforcement Earning For Urniture Ayout Imulation in Ndoor Raphics Cenes
6 pages
Online Quiz 2
No ratings yet
Online Quiz 2
5 pages
Learn&Fuzz: Machine Learning For Input Fuzzing: Patrice Godefroid Hila Peleg Rishabh Singh
No ratings yet
Learn&Fuzz: Machine Learning For Input Fuzzing: Patrice Godefroid Hila Peleg Rishabh Singh
10 pages
TCS Technical Interview Questions and Answers 2011
No ratings yet
TCS Technical Interview Questions and Answers 2011
16 pages
Algorithms_Midsems[1]
No ratings yet
Algorithms_Midsems[1]
170 pages
Lab - Exp - 7 (Dealing With Polymorphism and Inheritance)
No ratings yet
Lab - Exp - 7 (Dealing With Polymorphism and Inheritance)
4 pages
Artificial Intelligence - SQP
No ratings yet
Artificial Intelligence - SQP
12 pages
CSCI250 - Sample Final Exam - Solution
No ratings yet
CSCI250 - Sample Final Exam - Solution
7 pages
Assignment 6
No ratings yet
Assignment 6
7 pages
TestDaily分享-ap15 comp sci sg
No ratings yet
TestDaily分享-ap15 comp sci sg
10 pages
NeurIPS-2022-adam-can-converge-without-any-modification-on-update-rules-Paper-Conference
No ratings yet
NeurIPS-2022-adam-can-converge-without-any-modification-on-update-rules-Paper-Conference
14 pages
From-To Chart Example
No ratings yet
From-To Chart Example
22 pages
Lab - Exp - 7 (Dealing With Polymorphism and Inheritance)
No ratings yet
Lab - Exp - 7 (Dealing With Polymorphism and Inheritance)
4 pages
Nco Sample Paper Class-10
No ratings yet
Nco Sample Paper Class-10
2 pages
NTA NET Paper 1 5th December 2019 Evening Shift (Solutions - Explanations - Keys - Answers at Doorsteptutor. Com)
No ratings yet
NTA NET Paper 1 5th December 2019 Evening Shift (Solutions - Explanations - Keys - Answers at Doorsteptutor. Com)
4 pages
Anonymous
No ratings yet
Anonymous
5 pages
Assign-1 CE-415-S2024
No ratings yet
Assign-1 CE-415-S2024
3 pages
CS201 All Term Paperz in 1 File (Pages 1 To 63) Eagle - Eye
No ratings yet
CS201 All Term Paperz in 1 File (Pages 1 To 63) Eagle - Eye
63 pages
TouchCode Class 8
From Everand
TouchCode Class 8
Team Orange
No ratings yet
11 The System Unit of A Computer System: Let's Think
No ratings yet
11 The System Unit of A Computer System: Let's Think
4 pages
ACT Broadband Connection Bill Queries - Payment Related FAQs
No ratings yet
ACT Broadband Connection Bill Queries - Payment Related FAQs
2 pages
Tech News and Updates
No ratings yet
Tech News and Updates
3 pages
Nodepay Extension - Chrome Web Store
No ratings yet
Nodepay Extension - Chrome Web Store
3 pages
CompTIA A+
No ratings yet
CompTIA A+
361 pages
Harbour Minigui
No ratings yet
Harbour Minigui
95 pages
Product Marketing Manager Resume
100% (2)
Product Marketing Manager Resume
6 pages
Webmashup
No ratings yet
Webmashup
2 pages
Cracking the YouTube Code With VidIQ AI Tool (Chukwubueze, Israel Joshua) (Z-Library)
100% (1)
Cracking the YouTube Code With VidIQ AI Tool (Chukwubueze, Israel Joshua) (Z-Library)
45 pages
Database Testing and SQL Notes
No ratings yet
Database Testing and SQL Notes
4 pages
Schedule D
No ratings yet
Schedule D
184 pages
Name: Sadikshya Khanal Section: C3G2: Workshop - 9 - Hadoop Part 2
No ratings yet
Name: Sadikshya Khanal Section: C3G2: Workshop - 9 - Hadoop Part 2
51 pages
event consumption model
No ratings yet
event consumption model
4 pages
Shivraj Gaikwad - QK
No ratings yet
Shivraj Gaikwad - QK
5 pages
Power BI Vs Tableau
No ratings yet
Power BI Vs Tableau
6 pages
Data Reliability and Error Correction For NAND Flash Memory System (PHD Thesis)
No ratings yet
Data Reliability and Error Correction For NAND Flash Memory System (PHD Thesis)
157 pages
Itautec W7535 W243huq W244huq - 6-7P-W24H5-002
No ratings yet
Itautec W7535 W243huq W244huq - 6-7P-W24H5-002
98 pages
Collaborative Ict Development: Empowerment Technologies Empowerment Technologies
100% (3)
Collaborative Ict Development: Empowerment Technologies Empowerment Technologies
26 pages
2.3.1 Algorithms - Design of algorithms
No ratings yet
2.3.1 Algorithms - Design of algorithms
9 pages
Cool Bot Pro Spec Sheet 2020
No ratings yet
Cool Bot Pro Spec Sheet 2020
1 page
Data-Science-Assignments
No ratings yet
Data-Science-Assignments
6 pages
Carpuride Guia
No ratings yet
Carpuride Guia
16 pages
STM 32 L 452 CC
No ratings yet
STM 32 L 452 CC
221 pages
Datasheet for Intel Core 12th Generation Processors (Volume 1 of 2)
No ratings yet
Datasheet for Intel Core 12th Generation Processors (Volume 1 of 2)
329 pages
TX en Tcd230032ab 20240701 Manual W
No ratings yet
TX en Tcd230032ab 20240701 Manual W
6 pages
Expert PDF Editor Serial
0% (3)
Expert PDF Editor Serial
2 pages
Best Font Resume Writing
100% (1)
Best Font Resume Writing
7 pages

A Painless Q-Learning Tutorial

Uploaded by

A Painless Q-Learning Tutorial

Uploaded by

3/17/2020 A Painless Q-Learning Tutorial

Contact (); Links ();

Home (); AI (); Robotics (); Notes (); About ();

Home > AI Main > Path-Finding > Q-Learning > Tutorial

Path Q-Learning .*; News Links:

Search News Links:

The terminology in Q-Learning includes the terms "state" and "action".

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

The Q-Learning algorithm goes as follows:

Q-Learning Example By Hand

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

public void footer() {

You might also like