0% found this document useful (0 votes)

5 views

Project3 Final Report Pair12

Project 3

Uploaded by

azeemrm630

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Project3 Final Report Pair12

Project 3

Uploaded by

azeemrm630

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

CSCE 5210

Fundamentals of Artificial Intelligence

Project 3
Final Report

Project: Reinforcement Learning using Bellman’s value iteration

Naga Sai Sivani, Tutika Ketha, Tirumuru
[email protected] [email protected]
11703058 11597873

Task T1

R1 = 10
R2 = -5
r = -5
Task T2

We will evaluate policy P1 by examining the 19 possible start points. We will focus on
weather there is a path to destination D from this start point S and is it the shortest path.

Starting Path Shortest Path Is No. Of No. Of

point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 2 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 2 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 2 1
(2,0) Yes 5 5 Yes 3 2
(2,1) Yes 4 4 Yes 1 1
(2,3) Yes 2 2 Yes 1 1
(2,4) Yes 3 3 Yes 3 2
(3,0) Yes 6 6 Yes 2 2
(3,1) Yes 5 5 Yes 2 2
(3,2) Yes 4 4 Yes 2 1
(3,3) Yes 3 3 Yes 1 1
(3,4) Yes 4 4 Yes 2 2
(4,1) Yes 6 6 Yes 2 2
(4,2) Yes 5 5 Yes 3 2
(4,3) Yes 4 4 Yes 2 2
Fraction 19/19 = 1 19/19 = 1 37 29/37 =
0.78
table 1: R1 = 10, R2 = -5, r = -5

Conclusion: All starting points had a path with shortest distance in policy P1. And
there was total 37 paths that led starting points to destination. Out of which 29 were
shortest paths, resulting in about 78% of the solutions being optimal.
Task T3

1. Policy P2
R1 = 10
R2 = -50
r = -5

For T3 part 1, we ran the value iteration using destination reward R1 = 50, negative
reward for hazard R2 = -50 and with live-in reward of –5. The policy P2 did have some
changes, but on closer observation all changes appeared near the borders. With the
help of following table we can see how effective P2 was and compare with P1.

Starting Path Shortest Path Is No. of No. Of

point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 2 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 1 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 1 1
(2,0) Yes 5 5 Yes 2 2
(2,1) Yes 4 4 Yes 2 1
(2,3) Yes 2 2 Yes 2 1
(2,4) Yes 3 3 Yes 2 2
(3,0) Yes 6 6 Yes 2 2
(3,1) Yes 5 5 Yes 3 2
(3,2) Yes 4 4 Yes 3 1
(3,3) Yes 3 3 Yes 2 1
(3,4) Yes 4 4 Yes 2 2
(4,1) Yes 6 6 Yes 2 2
(4,2) Yes 5 5 Yes 3 2
(4,3) Yes 4 4 Yes 1 1
Fraction 19/19 = 1 19/19 = 1 37 28/37=0.7
6
table 2: R1 = 50, R2 = -50, r = -5

Policy 2 also has paths from all starting points and also has an available shortest path
from each starting point to the destination. But due to change in some directions the
no. of paths from each starting point changed slightly.

Similar to policy P1, P2 had total 37 paths and 28 paths were optimal with shortest
distance. Resulting in 76% of paths being optimal, which is slightly less than P1 that
had 78% but almost same.

Possible reason: This may have happened due to the increase in the negative reward.
In first policy we can see that the impact was limited to one block over the peripheral of
Hazard H and now it transmitted futher in the grid.
2. Policy P3
R1 = 100
R2 = -500
r = -5

For T3 part 2, we ran the value iteration using destination reward R1 = 100, negative
reward for hazard R2 = -500 and with live-in reward of –5. The policy P2 did have some
changes, but on closer observation all changes appeared near the borders. With the
help of following table we can see how effective P3 was and compare with P1 & P2.

Starting Path Shortest Path Is No. of No.of

point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 1 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 1 1
(1,1) Yes 3 3 Yes 3 2
(1,2) Yes 2 2 Yes 3 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 1 1
(2,0) Yes 5 5 Yes 2 2
(2,1) Yes 4 4 Yes 2 1
(2,3) Yes 2 2 Yes 2 1
(2,4) Yes 3 3 Yes 2 2
(3,0) Yes 6 6 Yes 1 1
(3,1) Yes 5 5 Yes 3 2
(3,2) Yes 4 6 No 2 0
(3,3) Yes 3 3 No 1 0
(3,4) Yes 4 4 Yes 1 1
(4,1) Yes 6 8 No 2 0
(4,2) Yes 5 5 No 2 0
(4,3) Yes 4 6 No 1 0
Fraction 19/19 = 1 16/19 = 33 19/33 =
0.84 0.58
table 3: R1 = 100, R2 = -500, r = -5

Policy P3 too has valid paths for all starting points but fails to provide shortest possible
path for each starting point. So it may not be as optimal as P1 & P2 but still succusseds
to provide a valid policy with 84% of the time providing shortest path.

In total it provides 33 paths from starting to finish with 58% of the time having showing
shortest paths. Which is significantly lower than other two policies P1 & P2.

Possible reason: The difference in the magnitudes of exit reward, hazard (R1 & R2) to
the magnitude of live-in reward r. As R1 and R2 were in 100s and live-in was only –5 the
agent can take it’s time in finding the path to get a greater reward or to avoid major
hazard.
3. Policy P4

For each of the above task we achieved a policy that has a path to destination from
every possible starting point. But, we can rule out policy P3 as compared to P1 & P2 in
terms of optimization. As it failed to provide shortest path to each starting point.

Combinations of R1 and R2 used before with r = -1

R1 = 10

R2 = -5

r = -1

Starting Path Shortest Path Is No. of No. Of

point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 2 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 2 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 2 1
(2,0) Yes 5 5 Yes 3 2
(2,1) Yes 4 4 Yes 1 1
(2,3) Yes 2 2 Yes 1 1
(2,4) Yes 3 3 Yes 3 2
(3,0) Yes 6 6 Yes 2 2
(3,1) Yes 5 5 Yes 2 2
(3,2) Yes 4 4 Yes 2 1
(3,3) Yes 3 3 Yes 1 1
(3,4) Yes 4 4 Yes 1 1
(4,1) Yes 6 6 Yes 2 2
(4,2) Yes 5 5 Yes 3 2
(4,3) Yes 4 4 Yes 1 1
Fraction 19/19 = 1 19/19 = 1 35 27/35 =
0.77
table 4: R1 = 10, R2 = -5, r = -1

With R1 =10, R2 = -5 and live-in reward –1, we were able to get a policy that has shortest
paths to all starting points to destination D. And total of 35 paths i.e is less than P1 & P2
but 27 out of them were optimal. It had almost same fraction of paths that were optimal
as P1 & P2.

T3: Part 3

R1 = 50

R2 = -50

r = -1

Starting Path Shortest Path Is No. of No. Of

point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 2 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 1 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 1 1
(2,0) Yes 5 5 Yes 2 2
(2,1) Yes 4 4 Yes 2 1
(2,3) Yes 2 2 Yes 2 1
(2,4) Yes 3 3 Yes 2 2
(3,0) Yes 6 6 Yes 2 2
(3,1) Yes 5 5 Yes 3 2
(3,2) Yes 4 4 Yes 2 1
(3,3) Yes 3 3 Yes 2 1
(3,4) Yes 4 4 Yes 2 2
(4,1) Yes 6 6 Yes 2 2
(4,2) Yes 5 5 Yes 3 2
(4,3) Yes 4 4 Yes 1 1
Fraction 19/19 = 1 19/19 = 1 36 28/36 =
0.8
table 5: R1 = 50, R2 = -50, r = -1

With R1 50 and R2 –50 and live-in reward r –1 we got almost same results, it got 36
paths with 80% of them to be optimal. That is the best till now, but by only 0.1 to 0.3
margin. And mostly because it has less no. of paths in total from starting to final
positions.

T3: Part 3

R1 = 100, R2 = -500, r = -1
Starting Path Shortest Path Is No. of No. Of
point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 1 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 1 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 1 1
(2,0) Yes 5 5 Yes 2 2
(2,1) Yes 4 4 Yes 3 1
(2,3) Yes 2 2 Yes 3 1
(2,4) Yes 3 3 Yes 2 2
(3,0) Yes 6 6 Yes 1 1
(3,1) Yes 5 5 Yes 4 2
(3,2) Yes 4 6 No 3 0
(3,3) Yes 3 3 Yes 2 1
(3,4) Yes 4 4 Yes 1 1
(4,1) Yes 6 8 No 2 0
(4,2) Yes 5 5 No 2 0
(4,3) Yes 4 6 No 1 0
Fraction 19/19 = 1 15/19 = 36 20/36 =
0.79 0.56
table 6: R1 = 100, R2 = -500, r = -1

This was not at all necessary as we have already ruled out this combination of R1 and
R2, but we can see it was only able to find shortest path to 70% of starting positions and
only 56% of total paths were optimal.

Final conclusion T3:

-5 and –1 gave almost same results, but if we go with numbers we would prefer –1. But
according to our observation the values of R1 and R2 effected more than r.
Task T4

Value iteration for multi-agent dynamic environment

To modify the value iteration function for a multi-agent environment, several

considerations need to be considered regarding changes in the environment, agents, and
the new restrictions. These considerations may include, but are not limited to, the
movement of agents, the introduction of dynamic obstacles, the possibility of multiple
destinations for multiple agents, and modifications to the calculation of Bellman's
equation. The objective is to generalize this process to derive a policy and a method for
visualizing it. One approach is to introduce another probability variable, such as
trans_prob, to account for the presence of obstacles in the next state for a given action.
Additionally, it's crucial to address how each agent should navigate around obstacles
according to the policy. However, the complexity of the problem increases significantly
with the number of actions, as now we need to consider the number of agents and
potential destinations.

In real-world scenarios, the environment may be extensive, and it may be impractical to

have a single solution, such as a policy, to address all situations effectively. Therefore,
it's essential to be able to apply value iteration on the go for each agent separately to
adapt to dynamic scenarios.

Possible Pseudo Code

Function value_iteration_multi_agent_environment(environment, R1, R2, r, agents,

actions a, states s):

Initialize:

Value function V(agent, s) for each agent and state s

Policy π(agent, s) for each agent and state s

Transition probabilities P(agent, s' | s, a) for each agent, state-action pair (s, a)

Repeat until convergence or maximum iterations:

For each agent in agents:

Repeat until convergence or maximum iterations:

For each state s in the environment:

For each action a:

Calculate the expected value of taking action a from state s:

Q(agent, s, a) = Σ [P(agent, s' | s, a) * (R(agent, s, a, s') + γ * V(agent, s'))]

Update the value function for state s for the current agent:

V(agent, s) = max(Q(agent, s, a)) for all actions a

Update the policy for state s for the current agent:

π(agent, s) = argmax(Q(agent, s, a)) for all actions a

Return (value_functions, policies) for all agents

Machine Learning in Finance: Matthew F. Dixon Igor Halperin Paul Bilokon
82% (11)
Machine Learning in Finance: Matthew F. Dixon Igor Halperin Paul Bilokon
565 pages
Name: Abuga Kerubo Valary ADM. NO: BUS-242-049/2014 Unit: Artificial Intelligence Lecturer: MR - Okuku
No ratings yet
Name: Abuga Kerubo Valary ADM. NO: BUS-242-049/2014 Unit: Artificial Intelligence Lecturer: MR - Okuku
6 pages
Chapter 25 Taha
100% (2)
Chapter 25 Taha
20 pages
Artificall Intellegence
No ratings yet
Artificall Intellegence
7 pages
Kanishk Mishra-CS3346-Assignment 1
No ratings yet
Kanishk Mishra-CS3346-Assignment 1
6 pages
Chapter_14-_Manual_-_1.0
No ratings yet
Chapter_14-_Manual_-_1.0
10 pages
Dynamic Programming: Design and Analysis of Algorithms
No ratings yet
Dynamic Programming: Design and Analysis of Algorithms
41 pages
Deterministic Dynamic Programming: To The Next
No ratings yet
Deterministic Dynamic Programming: To The Next
52 pages
Solved Examples For Chapter 19
No ratings yet
Solved Examples For Chapter 19
7 pages
assign2
No ratings yet
assign2
7 pages
Data Structures and Algorithms Unit - V: Dynamic Programming
No ratings yet
Data Structures and Algorithms Unit - V: Dynamic Programming
19 pages
QTM Model Paper With Solution
No ratings yet
QTM Model Paper With Solution
16 pages
Data Structures and Algorithms Unit - V: Dynamic Programming
No ratings yet
Data Structures and Algorithms Unit - V: Dynamic Programming
19 pages
Exp 3 - MLOA - C121
No ratings yet
Exp 3 - MLOA - C121
6 pages
Chapter Four and Five
No ratings yet
Chapter Four and Five
56 pages
Asignment 2
No ratings yet
Asignment 2
3 pages
Bellman Routingproblem 1958
No ratings yet
Bellman Routingproblem 1958
5 pages
Travelling Salesman Problem Using Branch and Bound Approach: Chaitanya Pothineni December 13, 2013
No ratings yet
Travelling Salesman Problem Using Branch and Bound Approach: Chaitanya Pothineni December 13, 2013
8 pages
2022_may_optimization-techniques2020-pattern
No ratings yet
2022_may_optimization-techniques2020-pattern
8 pages
Daa - Unit IV 2 Chapter BRANCH&BOUND
No ratings yet
Daa - Unit IV 2 Chapter BRANCH&BOUND
33 pages
dynamic programming
No ratings yet
dynamic programming
9 pages
DAA Unit05 Part2
No ratings yet
DAA Unit05 Part2
32 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
Lecture 5
No ratings yet
Lecture 5
56 pages
5 Dynamic Programming
No ratings yet
5 Dynamic Programming
16 pages
or qb model
No ratings yet
or qb model
3 pages
iuh_artificial_intelligence_solutions_2
No ratings yet
iuh_artificial_intelligence_solutions_2
7 pages
Daa Unit III
No ratings yet
Daa Unit III
82 pages
DSA Assignment
No ratings yet
DSA Assignment
5 pages
A-Star Sample Question
No ratings yet
A-Star Sample Question
3 pages
DAA Unit 4
No ratings yet
DAA Unit 4
13 pages
Multistage Backward
No ratings yet
Multistage Backward
13 pages
UNIT-III Prat-2
No ratings yet
UNIT-III Prat-2
8 pages
Combinepdf
No ratings yet
Combinepdf
85 pages
Ms 631-Assignment - 13 Cs 99
No ratings yet
Ms 631-Assignment - 13 Cs 99
12 pages
Wa 1
No ratings yet
Wa 1
9 pages
1 Subjectivesolvemth 601
No ratings yet
1 Subjectivesolvemth 601
19 pages
6690 01 Que 20060608
No ratings yet
6690 01 Que 20060608
7 pages
DAA U-4 Dynamic Programming
No ratings yet
DAA U-4 Dynamic Programming
23 pages
Bitf19a017-Hw - Informed Search
No ratings yet
Bitf19a017-Hw - Informed Search
7 pages
MCS 031
No ratings yet
MCS 031
15 pages
E11 - E12 - 0244 - Mat2004 - 100209
No ratings yet
E11 - E12 - 0244 - Mat2004 - 100209
3 pages
Lec 25
No ratings yet
Lec 25
20 pages
LAB 5_page-0001
No ratings yet
LAB 5_page-0001
8 pages
01 Module 1 Early Reinforcement Learning
No ratings yet
01 Module 1 Early Reinforcement Learning
134 pages
Ai Mid Done
No ratings yet
Ai Mid Done
5 pages
OR
No ratings yet
OR
9 pages
Algorithmic Game Theory - Problem Set 1
No ratings yet
Algorithmic Game Theory - Problem Set 1
3 pages
Unit Iv R23
No ratings yet
Unit Iv R23
31 pages
CCCS314 - DAA - 22 - 23 - 3rd 06 Dynamic Programming
No ratings yet
CCCS314 - DAA - 22 - 23 - 3rd 06 Dynamic Programming
38 pages
3D A Level Decision Mathematics 1 QP
No ratings yet
3D A Level Decision Mathematics 1 QP
36 pages
AI assiggnment Afrar
No ratings yet
AI assiggnment Afrar
9 pages
RL - Assignment-2
No ratings yet
RL - Assignment-2
3 pages
15-381 Spring 2007 Assignment 1 Solutions: Out: January 23rd, 2007 Due: February 6th, 1:30pm Tuesday
No ratings yet
15-381 Spring 2007 Assignment 1 Solutions: Out: January 23rd, 2007 Due: February 6th, 1:30pm Tuesday
5 pages
Homework Solutions
No ratings yet
Homework Solutions
5 pages
Quantitative Techniques in Business
No ratings yet
Quantitative Techniques in Business
3 pages
Exam II 2013 - Solution
No ratings yet
Exam II 2013 - Solution
10 pages
BITF12E031
No ratings yet
BITF12E031
4 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
Maths Class Xii Chapter 12 Linear Programming Practice Paper 13
No ratings yet
Maths Class Xii Chapter 12 Linear Programming Practice Paper 13
5 pages
Travelling Salesman Problem (Dynamic Approach)
No ratings yet
Travelling Salesman Problem (Dynamic Approach)
5 pages
Analytic Geometry: Graphic Solutions Using Matlab Language
From Everand
Analytic Geometry: Graphic Solutions Using Matlab Language
Ing. Mario Castillo
No ratings yet
Assignment #3:: Group 15
No ratings yet
Assignment #3:: Group 15
6 pages
Final Exam: CS 188 Spring 2011 Introduction To Artificial Intelligence
0% (1)
Final Exam: CS 188 Spring 2011 Introduction To Artificial Intelligence
14 pages
Belfadil Et Al 2023 Leveraging Deep Reinforcement Learning For Water Distribution Systems With Large Action Spaces and
No ratings yet
Belfadil Et Al 2023 Leveraging Deep Reinforcement Learning For Water Distribution Systems With Large Action Spaces and
9 pages
UNIT 3
No ratings yet
UNIT 3
32 pages
2024(Automatica Re)Robust Q-learning algorithm for Markov decision processes under Wasserstein uncertainty
No ratings yet
2024(Automatica Re)Robust Q-learning algorithm for Markov decision processes under Wasserstein uncertainty
13 pages
q2B Review Sol
No ratings yet
q2B Review Sol
14 pages
Aipython
No ratings yet
Aipython
350 pages
Dynamic Programming and Optimal Control: Programming Exercise Topic: Infinite Horizon Problems
No ratings yet
Dynamic Programming and Optimal Control: Programming Exercise Topic: Infinite Horizon Problems
5 pages
CS 7642, Reinforcement Learning and Decision Making: General Information
No ratings yet
CS 7642, Reinforcement Learning and Decision Making: General Information
5 pages
Data Driven Control
No ratings yet
Data Driven Control
2 pages
Integral And Inverse Reinforcement Learning For Optimal Control Systems And Games Bosen Lian download
100% (1)
Integral And Inverse Reinforcement Learning For Optimal Control Systems And Games Bosen Lian download
83 pages
rl-unit5
No ratings yet
rl-unit5
101 pages
Formula Sheet: Section 1 - Deterministic Dynamic Programming
No ratings yet
Formula Sheet: Section 1 - Deterministic Dynamic Programming
10 pages
Syllabus SVIIT CSE BTech (CSE) V 2018 19 - 21.12.2020
No ratings yet
Syllabus SVIIT CSE BTech (CSE) V 2018 19 - 21.12.2020
23 pages
Markov Decision
No ratings yet
Markov Decision
11 pages
Introduction to Discrete Event Systems 3rd Edition Christos G Cassandras Stéphane Lafortune - The ebook is ready for download, no waiting required
100% (2)
Introduction to Discrete Event Systems 3rd Edition Christos G Cassandras Stéphane Lafortune - The ebook is ready for download, no waiting required
65 pages
Sample Questions For COMP-424 Final Exam: Doina Precup
No ratings yet
Sample Questions For COMP-424 Final Exam: Doina Precup
4 pages
Recommender Systems - Chaptre1
No ratings yet
Recommender Systems - Chaptre1
62 pages
MLA Question Bank
No ratings yet
MLA Question Bank
25 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Optimization For Learning and Control - 2023 - Hansson
No ratings yet
Optimization For Learning and Control - 2023 - Hansson
413 pages
Maximum Entropy Inverse Reinforcement Learning: Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey
No ratings yet
Maximum Entropy Inverse Reinforcement Learning: Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey
6 pages
AICTE Model Curriculum of Courses at UG Level in Emerging Areas
No ratings yet
AICTE Model Curriculum of Courses at UG Level in Emerging Areas
38 pages
10 1108 - Mabr 10 2018 0042
No ratings yet
10 1108 - Mabr 10 2018 0042
22 pages
UT Dallas Syllabus For cs4375.501 06f Taught by Yu Chung NG (Ycn041000)
No ratings yet
UT Dallas Syllabus For cs4375.501 06f Taught by Yu Chung NG (Ycn041000)
6 pages
Lecture 4: Sequential Decision Making: Simon Parsons
No ratings yet
Lecture 4: Sequential Decision Making: Simon Parsons
94 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
Cost-Effective Condition-Based Maintenance Using Markov Decision Processes
No ratings yet
Cost-Effective Condition-Based Maintenance Using Markov Decision Processes
6 pages

Project3 Final Report Pair12

Uploaded by

Project3 Final Report Pair12

Uploaded by

CSCE 5210

Fundamentals of Artificial Intelligence

Project: Reinforcement Learning using Bellman’s value iteration

Starting Path Shortest Path Is No. Of No. Of

Starting Path Shortest Path Is No. of No. Of

Starting Path Shortest Path Is No. of No.of

Combinations of R1 and R2 used before with r = -1

Starting Path Shortest Path Is No. of No. Of

Starting Path Shortest Path Is No. of No. Of

Final conclusion T3:

Value iteration for multi-agent dynamic environment

To modify the value iteration function for a multi-agent environment, several

In real-world scenarios, the environment may be extensive, and it may be impractical to

Possible Pseudo Code

Function value_iteration_multi_agent_environment(environment, R1, R2, r, agents,

Value function V(agent, s) for each agent and state s

Policy π(agent, s) for each agent and state s

Repeat until convergence or maximum iterations:

For each agent in agents:

Repeat until convergence or maximum iterations:

For each state s in the environment:

For each action a:

Calculate the expected value of taking action a from state s:

V(agent, s) = max(Q(agent, s, a)) for all actions a

Update the policy for state s for the current agent:

π(agent, s) = argmax(Q(agent, s, a)) for all actions a

Return (value_functions, policies) for all agents

You might also like