0% found this document useful (0 votes)
5 views

Project3 Final Report Pair12

Project 3

Uploaded by

azeemrm630
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Project3 Final Report Pair12

Project 3

Uploaded by

azeemrm630
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

CSCE 5210

Fundamentals of Artificial Intelligence


Project 3
Final Report

Project: Reinforcement Learning using Bellman’s value iteration


Naga Sai Sivani, Tutika Ketha, Tirumuru
[email protected] [email protected]
11703058 11597873

Task T1

R1 = 10
R2 = -5
r = -5
Task T2

We will evaluate policy P1 by examining the 19 possible start points. We will focus on
weather there is a path to destination D from this start point S and is it the shortest path.

Starting Path Shortest Path Is No. Of No. Of


point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 2 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 2 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 2 1
(2,0) Yes 5 5 Yes 3 2
(2,1) Yes 4 4 Yes 1 1
(2,3) Yes 2 2 Yes 1 1
(2,4) Yes 3 3 Yes 3 2
(3,0) Yes 6 6 Yes 2 2
(3,1) Yes 5 5 Yes 2 2
(3,2) Yes 4 4 Yes 2 1
(3,3) Yes 3 3 Yes 1 1
(3,4) Yes 4 4 Yes 2 2
(4,1) Yes 6 6 Yes 2 2
(4,2) Yes 5 5 Yes 3 2
(4,3) Yes 4 4 Yes 2 2
Fraction 19/19 = 1 19/19 = 1 37 29/37 =
0.78
table 1: R1 = 10, R2 = -5, r = -5

Conclusion: All starting points had a path with shortest distance in policy P1. And
there was total 37 paths that led starting points to destination. Out of which 29 were
shortest paths, resulting in about 78% of the solutions being optimal.
Task T3

1. Policy P2
R1 = 10
R2 = -50
r = -5

For T3 part 1, we ran the value iteration using destination reward R1 = 50, negative
reward for hazard R2 = -50 and with live-in reward of –5. The policy P2 did have some
changes, but on closer observation all changes appeared near the borders. With the
help of following table we can see how effective P2 was and compare with P1.

Starting Path Shortest Path Is No. of No. Of


point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 2 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 1 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 1 1
(2,0) Yes 5 5 Yes 2 2
(2,1) Yes 4 4 Yes 2 1
(2,3) Yes 2 2 Yes 2 1
(2,4) Yes 3 3 Yes 2 2
(3,0) Yes 6 6 Yes 2 2
(3,1) Yes 5 5 Yes 3 2
(3,2) Yes 4 4 Yes 3 1
(3,3) Yes 3 3 Yes 2 1
(3,4) Yes 4 4 Yes 2 2
(4,1) Yes 6 6 Yes 2 2
(4,2) Yes 5 5 Yes 3 2
(4,3) Yes 4 4 Yes 1 1
Fraction 19/19 = 1 19/19 = 1 37 28/37=0.7
6
table 2: R1 = 50, R2 = -50, r = -5

Policy 2 also has paths from all starting points and also has an available shortest path
from each starting point to the destination. But due to change in some directions the
no. of paths from each starting point changed slightly.

Similar to policy P1, P2 had total 37 paths and 28 paths were optimal with shortest
distance. Resulting in 76% of paths being optimal, which is slightly less than P1 that
had 78% but almost same.

Possible reason: This may have happened due to the increase in the negative reward.
In first policy we can see that the impact was limited to one block over the peripheral of
Hazard H and now it transmitted futher in the grid.
2. Policy P3
R1 = 100
R2 = -500
r = -5

For T3 part 2, we ran the value iteration using destination reward R1 = 100, negative
reward for hazard R2 = -500 and with live-in reward of –5. The policy P2 did have some
changes, but on closer observation all changes appeared near the borders. With the
help of following table we can see how effective P3 was and compare with P1 & P2.

Starting Path Shortest Path Is No. of No.of


point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 1 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 1 1
(1,1) Yes 3 3 Yes 3 2
(1,2) Yes 2 2 Yes 3 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 1 1
(2,0) Yes 5 5 Yes 2 2
(2,1) Yes 4 4 Yes 2 1
(2,3) Yes 2 2 Yes 2 1
(2,4) Yes 3 3 Yes 2 2
(3,0) Yes 6 6 Yes 1 1
(3,1) Yes 5 5 Yes 3 2
(3,2) Yes 4 6 No 2 0
(3,3) Yes 3 3 No 1 0
(3,4) Yes 4 4 Yes 1 1
(4,1) Yes 6 8 No 2 0
(4,2) Yes 5 5 No 2 0
(4,3) Yes 4 6 No 1 0
Fraction 19/19 = 1 16/19 = 33 19/33 =
0.84 0.58
table 3: R1 = 100, R2 = -500, r = -5

Policy P3 too has valid paths for all starting points but fails to provide shortest possible
path for each starting point. So it may not be as optimal as P1 & P2 but still succusseds
to provide a valid policy with 84% of the time providing shortest path.

In total it provides 33 paths from starting to finish with 58% of the time having showing
shortest paths. Which is significantly lower than other two policies P1 & P2.

Possible reason: The difference in the magnitudes of exit reward, hazard (R1 & R2) to
the magnitude of live-in reward r. As R1 and R2 were in 100s and live-in was only –5 the
agent can take it’s time in finding the path to get a greater reward or to avoid major
hazard.
3. Policy P4

For each of the above task we achieved a policy that has a path to destination from
every possible starting point. But, we can rule out policy P3 as compared to P1 & P2 in
terms of optimization. As it failed to provide shortest path to each starting point.

Combinations of R1 and R2 used before with r = -1

R1 = 10

R2 = -5

r = -1

Starting Path Shortest Path Is No. of No. Of


point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 2 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 2 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 2 1
(2,0) Yes 5 5 Yes 3 2
(2,1) Yes 4 4 Yes 1 1
(2,3) Yes 2 2 Yes 1 1
(2,4) Yes 3 3 Yes 3 2
(3,0) Yes 6 6 Yes 2 2
(3,1) Yes 5 5 Yes 2 2
(3,2) Yes 4 4 Yes 2 1
(3,3) Yes 3 3 Yes 1 1
(3,4) Yes 4 4 Yes 1 1
(4,1) Yes 6 6 Yes 2 2
(4,2) Yes 5 5 Yes 3 2
(4,3) Yes 4 4 Yes 1 1
Fraction 19/19 = 1 19/19 = 1 35 27/35 =
0.77
table 4: R1 = 10, R2 = -5, r = -1

With R1 =10, R2 = -5 and live-in reward –1, we were able to get a policy that has shortest
paths to all starting points to destination D. And total of 35 paths i.e is less than P1 & P2
but 27 out of them were optimal. It had almost same fraction of paths that were optimal
as P1 & P2.

T3: Part 3

R1 = 50

R2 = -50

r = -1

Starting Path Shortest Path Is No. of No. Of


point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 2 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 1 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 1 1
(2,0) Yes 5 5 Yes 2 2
(2,1) Yes 4 4 Yes 2 1
(2,3) Yes 2 2 Yes 2 1
(2,4) Yes 3 3 Yes 2 2
(3,0) Yes 6 6 Yes 2 2
(3,1) Yes 5 5 Yes 3 2
(3,2) Yes 4 4 Yes 2 1
(3,3) Yes 3 3 Yes 2 1
(3,4) Yes 4 4 Yes 2 2
(4,1) Yes 6 6 Yes 2 2
(4,2) Yes 5 5 Yes 3 2
(4,3) Yes 4 4 Yes 1 1
Fraction 19/19 = 1 19/19 = 1 36 28/36 =
0.8
table 5: R1 = 50, R2 = -50, r = -1

With R1 50 and R2 –50 and live-in reward r –1 we got almost same results, it got 36
paths with 80% of them to be optimal. That is the best till now, but by only 0.1 to 0.3
margin. And mostly because it has less no. of paths in total from starting to final
positions.

T3: Part 3

R1 = 100, R2 = -500, r = -1
Starting Path Shortest Path Is No. of No. Of
point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 1 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 1 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 1 1
(2,0) Yes 5 5 Yes 2 2
(2,1) Yes 4 4 Yes 3 1
(2,3) Yes 2 2 Yes 3 1
(2,4) Yes 3 3 Yes 2 2
(3,0) Yes 6 6 Yes 1 1
(3,1) Yes 5 5 Yes 4 2
(3,2) Yes 4 6 No 3 0
(3,3) Yes 3 3 Yes 2 1
(3,4) Yes 4 4 Yes 1 1
(4,1) Yes 6 8 No 2 0
(4,2) Yes 5 5 No 2 0
(4,3) Yes 4 6 No 1 0
Fraction 19/19 = 1 15/19 = 36 20/36 =
0.79 0.56
table 6: R1 = 100, R2 = -500, r = -1

This was not at all necessary as we have already ruled out this combination of R1 and
R2, but we can see it was only able to find shortest path to 70% of starting positions and
only 56% of total paths were optimal.

Final conclusion T3:

-5 and –1 gave almost same results, but if we go with numbers we would prefer –1. But
according to our observation the values of R1 and R2 effected more than r.
Task T4

Value iteration for multi-agent dynamic environment

To modify the value iteration function for a multi-agent environment, several


considerations need to be considered regarding changes in the environment, agents, and
the new restrictions. These considerations may include, but are not limited to, the
movement of agents, the introduction of dynamic obstacles, the possibility of multiple
destinations for multiple agents, and modifications to the calculation of Bellman's
equation. The objective is to generalize this process to derive a policy and a method for
visualizing it. One approach is to introduce another probability variable, such as
trans_prob, to account for the presence of obstacles in the next state for a given action.
Additionally, it's crucial to address how each agent should navigate around obstacles
according to the policy. However, the complexity of the problem increases significantly
with the number of actions, as now we need to consider the number of agents and
potential destinations.

In real-world scenarios, the environment may be extensive, and it may be impractical to


have a single solution, such as a policy, to address all situations effectively. Therefore,
it's essential to be able to apply value iteration on the go for each agent separately to
adapt to dynamic scenarios.

Possible Pseudo Code

Function value_iteration_multi_agent_environment(environment, R1, R2, r, agents,


actions a, states s):

Initialize:

Value function V(agent, s) for each agent and state s

Policy π(agent, s) for each agent and state s

Transition probabilities P(agent, s' | s, a) for each agent, state-action pair (s, a)

Repeat until convergence or maximum iterations:

For each agent in agents:

Repeat until convergence or maximum iterations:

For each state s in the environment:

For each action a:

Calculate the expected value of taking action a from state s:


Q(agent, s, a) = Σ [P(agent, s' | s, a) * (R(agent, s, a, s') + γ * V(agent, s'))]

Update the value function for state s for the current agent:

V(agent, s) = max(Q(agent, s, a)) for all actions a

Update the policy for state s for the current agent:

π(agent, s) = argmax(Q(agent, s, a)) for all actions a

Return (value_functions, policies) for all agents

You might also like