0% found this document useful (0 votes)

40 views

Symbolic AI MDP DP

This document provides instructions for Assignment 4 which involves solving a maze environment using dynamic programming. Students are asked to implement value iteration and Q-value iteration algorithms to find the optimal value function and policy for the maze. The maze is defined in a text file and implemented in a Python world class. Students will modify a dynamic_programming Python file to include their algorithm implementations and submit their code along with text file definitions and answers to questions.

Uploaded by

Seyidali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Symbolic AI MDP DP

Uploaded by

Seyidali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Assignment 4: Discrete Markov Decision Process

Course: Symbolic AI, Leiden University

Written by: Thomas Moerland

For this assignment you will solve a similar maze as in the previous assignment, but this time
using a dynamic programming approach.

Assignment files, Python and Numpy

• First download the assignment zip file from Brightspace. Unzip it and inspect the folder.
It should contain three files: world.py, dynamic_programming.py, and an example envi-
ronment definition prison.txt.

• The coding assignment will be in Python 3. Verify that your operating system has Python
3 installed, otherwise install it.
• We will also need a specific package: NumPy, short for Numerical Python. Numpy is
an important Python package, which can be used if you want to do vector and matrix
operations. For the assignments, you will mostly need to understand indexing and slicing
into value and state-action value arrays.
• You can install a variety of scientific Python packages at once by installing the individual
version of the Anaconda package: https://www.anaconda.com/products/individual.
This will also give you a Python editor: Spyder. However, feel free to manually install
Python 3, NumPy, and an IDE of your own choice.

• If your are unfamiliar with Python and Numpy, you may take a quick tutorial for both
online. Do not spend too much time here, you will only need a few basic concepts, which
you can also search for during the assignment.

Handing in assignment You need to submit three files, where you replace groupnr with your
group number:
• dynamic_programming_groupnr.py, your modified version of dynamic_programming.py
with the relevant answers. Be sure to check whether your solution runs from the command
line.
• prison_groupnr.txt, your modified version of prison.txt for the relevant exercise.
• answers_groupnr.pdf, with your answers to the open questions in the assignment.

Submit these files to Brightspace assignment 4 in a single zip file named:

assignment4.zip.

1
1 Coding Exercises
In these exercises you will implement Value iteration (VI) and Q-value iteration (QI), two variants
of Dynamic Programming, to solve a given Markov Decision Process. We will first explain the
environment (the MDP definition), and the starting point for your algorithm implementation.

1.1 The environment

The environment is pre-coded for you in world.py. It contains the class definition of World(filename),
which will initialize the environment specified in the text file filename.
• Definition in txt file: You can define the environment in a txt file. An example is
provided in prison.txt. We use the following encoding:
? = agent location
# = wall
a (lower case) = key
A (upper case) = door
1 (numeric) = goal (terminates episode). The reward is equal to 10 times the numeric
element at the goal.

Figure 1: Prison example provided in prison.txt.

• Explanation of MDP:
– State space: The state is represented as an index. For the provided example, there are
64 unique states, since there are 16 free locations, and two keys. In each location, we
can hold or not hold either key, which gives rise to 16 · 2 · 2 possible states. These are
simply numbered 0-63. If you want to know what situation a particular state actually
represents, you can call the method World.print_state(state)), see below.
– Action space: In every state, the agent has four possible actions: {up, down, left,
right}.
– Dynamics: When the agent moves into a wall, it just remains at the same position.
The agent automatically picks up a key when stepping on the specific location, and
automatically opens the door when stepping on it while holding the specific key.
– Reward: The reward at every transition is −1, except when we reach a goal: the
reward is then equal to 10 times the numeric element at the goal. So, in the example
above, the reward at the goal is equal to 30.
– Gamma: We assume γ = 1.0 throughout the experiments.
• Attributes: A World object has a few important attributes:

2
– states returns a list of all states. When you initialize a map, all possible configu-
rations of agent location and key possession are automatically inferred for you, and
each possible combination is assigned a unique state index.
– n_states returns the total number of states (a scalar).
– actions a list of all possible actions.
– n_actions returns the total number of actions (a scalar).
– terminal indicates whether the agent has reached a goal (task terminates).
• Methods: An World object has several important methods.
– transition_function(s,a) computes, for a given state and action, the next state
s_prime and reward r. It does not affect the agent location!
– act(a) executes action a, i.e., it calls transition_function() and then actually
moves the agent. It also checks for termination.
– reset_agent() resets the agent to the start location, as given in the initial map. Also
sets the terminal attribute to False.
– get_current_state() returns the current state of the environment.
– print_state(s) prints the description what situation of the environment a particular
discrete state actually represents.
– print_map() prints the current map of the environment.

When you execute the world.py script from the command line, which in Python will execute
the code below if __name__ == ’__main__’:, found at the bottom of the file. This gives some
examples of the above methods. You can play around a little bit to familiarize yourself with the
environment.

3
1.2 The algorithm
For the exercises, you will implement two dynamic programming algorithms in the environ-
ment described above. You should use the dynamic_programming.py file, which contains the
DynamicProgramming() class.

• Attributes: An DynamicProgramming() object has two important attributes:

– V_s a value table. A value table is vector of length n_states. Each element in
the vector stores the value estimate for the corresponding state index, i.e. V (s =
4) =V_s[4]. If V_s = None, then you have not run any method yet to estimate the
optimal value table.
– Q_sa a state-action value table. A state-action value matrix of dimensions n_states×
n_actions. Actions are indexed according to World.actions = {up,down,left,right}.
For example, action up has index 0. Each element in the Q_sa matrix stores the value
estimate for the corresponding state-action, i.e., Q(s = 10, a = 0) =Q_sa[10,0]. If
Q_sa = None, then you have not run any method yet to estimate the optimal value
table.
• Methods: An World object has several important methods.

– value_iteration(self,env,gamma=1.0,theta=0.001) should run value iteration

on the environment env (of class World). You should implement this function your-
self. Gamma is the discount factor, which you can leave at the default value of 1.0.
Theta is the threshold for convergence, which you can also leave at the default value
of 0.001.
– Q_value_iteration(self,env,gamma=1.0,theta=0.001) should run Q-value itera-
tion on the environment env (of class World). You should implement this function
yourself.
– execute_policy(self,env) executes a policy on environment env. This function is
partially implemented for you. You should implement estimation of the greedy policy.

4
1.3 Exercise: Dynamic Programming (coding)
Start by executing dynamic_programming.py. This executes the code under if __name__ ==
’__main__’: at the bottom of the script. You can manually execute a policy in the environment,
and familiarize yourself with the environment.

1. Value iteration:
a Implement value iteration, in the DynamicProgramming.value_iteration() method.
Do not change the function arguments or return statements. A start value table is already
provided for you: V_s = np.zeros(env.n_states). Your function should compute the
optimal value function, and at the end of the function store the optimal value table
in self.V_s. Include a print statement that prints the error in each iteration of your
algorithm.
b Implement DynamicProgramming.execute_policy() to execute the greedy policy based
on the value table V (s). You only need to implement the code segment below if table
== ’V’ and self.V_s is not None:, which should set the greedy_action variable to
the greedy action (or one of the greedy actions) in the current state.
c Check whether your implementation works. Does our agent during execution follow the
optimal policy?

2. Q-value iteration
a Implement Q-value iteration in the DynamicProgramming.Q_value_iteration() method.
Do not change the function arguments or return statements. A start state-action value
table is already provided for you: Q_sa = np.zeros(env.n_states,env.n_actions).
Your function should compute the optimal state-action value function, and at the end of
the function store the optimal state-action value table in self.Q_sa.
b Implement DynamicProgramming.execute_policy() to execute the greedy policy based
on the state-action value table Q(s, a). You only need to implement the code seg-
ment below elif table == ’Q’ and self.Q_sa is not None:, which should set the
greedy_action variable to the greedy action (or one of the greedy actions) in the current
state.
c Check whether your implementation works. Does our agent during execution follow the
optimal policy?
3. Multiple goals

a Prison.txt only has a single goal. Adapt the prison so that it has two reachable goals.
You may also build a new maze, as long as it has two reachable goals. Each goal should be
reachable from the start location. Make your maze such that depending on the starting
location, the goal that is picked changes under the optimal policy. Note that each goal
is a terminal state.
b Run value iteration or Q-value iteration on your new environment, and describe the
observed agent behaviour.

5
2 Reflection Exercises
4. Reflection on Dynamic Programming:
When you successfully implemented DP, you saw that it solves the problem very fast. The
problem to which we applied is was however quite small. Imagine we have a world of size
100×100, which can have 10.000 free agent locations. And imagine this more complex
world has 30 keys and doors.

a How many unique states does this new problem have? (Note: you should count every
possible combination of agent location and key possession)
b Imagine we use 32-bit floating numbers to store the values in the table, i.e., every value
estimate takes 32 bits, or 4 bytes, in memory. How much memory would we roughly
need to store the value table for this new problem in memory?
c Roughly how fast would you solve this problem on your laptop? Explain your answer.
d Explain the curse of dimensionality. What aspect of our problem definition causes the
exponential growth?

5. Comparison to search:
We may also compare Dynamic Programming to the search approaches you have previously
encountered.
Imagine we apply an iterative deepening tree search (i.e., no graph search, so we
do not detect whether we already encountered a state, but simply expand the tree in all
directions) to the example problem in prison.txt.
a Estimate the time complexity of an iterative deepening tree search on the prison.txt
problem (hint: first compute the depth of the shortest path towards the goal).
b Compare the time complexity of iterative deepening tree search to the time complex-
ity you empirically observed for dynamic programming on the prison problem. Which
approach is faster?
c Compare the way Dynamic Programming stores the solution to the way tree/graph search
approaches store the solution. What could be a benefit of the DP representation, and
what could be a benefit of the tree/graph search representation?

Question 1) Search 10 Marks: Final Term Examination Spring-2020
No ratings yet
Question 1) Search 10 Marks: Final Term Examination Spring-2020
5 pages
Reebok - Marketing Plan
64% (11)
Reebok - Marketing Plan
53 pages
RL_20241103355_report
No ratings yet
RL_20241103355_report
4 pages
Adversial Search
No ratings yet
Adversial Search
38 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
RL_20241103355_report
No ratings yet
RL_20241103355_report
4 pages
MarkovDecisionProcesses Analysis
No ratings yet
MarkovDecisionProcesses Analysis
10 pages
Assignment 2
No ratings yet
Assignment 2
13 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
05 Adversarial Search
No ratings yet
05 Adversarial Search
51 pages
Hasnain Ali (22-CS-143)
No ratings yet
Hasnain Ali (22-CS-143)
12 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
AI (22-CS-72)
No ratings yet
AI (22-CS-72)
12 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Week-11 - Adversarial Search
No ratings yet
Week-11 - Adversarial Search
50 pages
M 2
No ratings yet
M 2
12 pages
Ai Exp 1-10
No ratings yet
Ai Exp 1-10
26 pages
cs188-su24-lec06
No ratings yet
cs188-su24-lec06
79 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
PA4
No ratings yet
PA4
8 pages
01 Module 1 Early Reinforcement Learning
No ratings yet
01 Module 1 Early Reinforcement Learning
134 pages
Reinforcement Ch.2
No ratings yet
Reinforcement Ch.2
77 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
HW 1 MeowMeow
No ratings yet
HW 1 MeowMeow
8 pages
CS6700 Reinforcement Learning PA1 Jan May 2024
No ratings yet
CS6700 Reinforcement Learning PA1 Jan May 2024
4 pages
RL UNIT V QA (1)
No ratings yet
RL UNIT V QA (1)
13 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
ML - 6 - Jupyter Notebook
No ratings yet
ML - 6 - Jupyter Notebook
5 pages
III_AI-DS_AD3311_AI_Lab Manual
No ratings yet
III_AI-DS_AD3311_AI_Lab Manual
34 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Assignment 1
No ratings yet
Assignment 1
24 pages
Ass1 Merged Merged
No ratings yet
Ass1 Merged Merged
15 pages
CS 188: Artificial Intelligence: Adversarial Search
No ratings yet
CS 188: Artificial Intelligence: Adversarial Search
44 pages
Technology
No ratings yet
Technology
7 pages
REINFORCE_Algorithm
No ratings yet
REINFORCE_Algorithm
15 pages
5 - MDP
No ratings yet
5 - MDP
42 pages
RL-UNIT IV QA
No ratings yet
RL-UNIT IV QA
16 pages
CSCN8020_Assignment1_W24
No ratings yet
CSCN8020_Assignment1_W24
4 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
48 pages
Exp No 1: Implementation of Toy Problems (Tic Tac Toe) : 8-Puzzle)
No ratings yet
Exp No 1: Implementation of Toy Problems (Tic Tac Toe) : 8-Puzzle)
12 pages
Reinforcement Learning - Project 3
No ratings yet
Reinforcement Learning - Project 3
9 pages
1 You Will Work With A
No ratings yet
1 You Will Work With A
2 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Assignment 1 Lab Report: 1 Task 1
No ratings yet
Assignment 1 Lab Report: 1 Task 1
12 pages
F29AI Exam 2020 2021
No ratings yet
F29AI Exam 2020 2021
11 pages
Algorithms To Solve An MDP
No ratings yet
Algorithms To Solve An MDP
24 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
AI Record
No ratings yet
AI Record
17 pages
Lec 08
No ratings yet
Lec 08
59 pages
Lab 2,3
No ratings yet
Lab 2,3
18 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Ai Lab File 2
No ratings yet
Ai Lab File 2
45 pages
ile_66
No ratings yet
ile_66
1 page
Module 04
No ratings yet
Module 04
63 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Sos - Aditi Kadam
No ratings yet
Sos - Aditi Kadam
8 pages
Lec 4
No ratings yet
Lec 4
16 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Emerging Trends in Fashion Marketing: A Case Study of Apparel Retailing in India
No ratings yet
Emerging Trends in Fashion Marketing: A Case Study of Apparel Retailing in India
8 pages
Programme (Schedule) Requirements under FIDIC Contracts
No ratings yet
Programme (Schedule) Requirements under FIDIC Contracts
8 pages
Merchandise Staff Directory Buyers 23-07-18a
No ratings yet
Merchandise Staff Directory Buyers 23-07-18a
2 pages
Micro Chap 5 Pure Monopoly MKT
No ratings yet
Micro Chap 5 Pure Monopoly MKT
15 pages
Brosur Danosa
No ratings yet
Brosur Danosa
34 pages
CCNA 2 v6 0 Final Exam Answers 2020 Routing & Switching Essentials
No ratings yet
CCNA 2 v6 0 Final Exam Answers 2020 Routing & Switching Essentials
91 pages
2000-Practical Analysis and Design Methods of Mechanically-Stabilized Earth walls-II. Design Comparisons and Impact of LRFD Method
No ratings yet
2000-Practical Analysis and Design Methods of Mechanically-Stabilized Earth walls-II. Design Comparisons and Impact of LRFD Method
22 pages
Underwater Welding Seminar Report
100% (1)
Underwater Welding Seminar Report
19 pages
5 Month LAC Plan
No ratings yet
5 Month LAC Plan
6 pages
Unit 5: Java Applets and Graphics Programming
No ratings yet
Unit 5: Java Applets and Graphics Programming
10 pages
10th Class Chapter 3
No ratings yet
10th Class Chapter 3
20 pages
Wa0004.
No ratings yet
Wa0004.
13 pages
R Tom Sellers: The Magic W and Series. Net
No ratings yet
R Tom Sellers: The Magic W and Series. Net
24 pages
Penn Drinks PRD
No ratings yet
Penn Drinks PRD
5 pages
Group 1 RFLIB Powerpoint
No ratings yet
Group 1 RFLIB Powerpoint
65 pages
Assessment of Gadgets Addiction and Its Impact On PDF
No ratings yet
Assessment of Gadgets Addiction and Its Impact On PDF
5 pages
Detailed Solutions Chapter 18
No ratings yet
Detailed Solutions Chapter 18
17 pages
Gojevic Et Al 2021 Materials CWA
No ratings yet
Gojevic Et Al 2021 Materials CWA
15 pages
Bioequivalence of Two Ibuprofen
No ratings yet
Bioequivalence of Two Ibuprofen
7 pages
Lesson Plan Computers 9-15
100% (1)
Lesson Plan Computers 9-15
7 pages
Flame Spread
No ratings yet
Flame Spread
4 pages
3.5 Firms - MCQ
No ratings yet
3.5 Firms - MCQ
8 pages
OMB Correspondence
No ratings yet
OMB Correspondence
101 pages
Lecture 2
No ratings yet
Lecture 2
35 pages
Exploratory Data Analysis and Crime Prevention Using Machine Learning The Case of Ghana
No ratings yet
Exploratory Data Analysis and Crime Prevention Using Machine Learning The Case of Ghana
10 pages
Garvey Et Al-2006-Research in Nursing & Health PDF
No ratings yet
Garvey Et Al-2006-Research in Nursing & Health PDF
11 pages
CMVR Forms
No ratings yet
CMVR Forms
1 page
OOP MSBTE Question Paper Winter 2007
No ratings yet
OOP MSBTE Question Paper Winter 2007
3 pages
21世纪金融创新与技术法案美国
No ratings yet
21世纪金融创新与技术法案美国
336 pages

Symbolic AI MDP DP

Uploaded by

Symbolic AI MDP DP

Uploaded by

Assignment 4: Discrete Markov Decision Process

Course: Symbolic AI, Leiden University

Assignment files, Python and Numpy

Submit these files to Brightspace assignment 4 in a single zip file named:

1.1 The environment

Figure 1: Prison example provided in prison.txt.

• Attributes: An DynamicProgramming() object has two important attributes:

– value_iteration(self,env,gamma=1.0,theta=0.001) should run value iteration

You might also like