Project3 Final Report Pair12
Project3 Final Report Pair12
Task T1
R1 = 10
R2 = -5
r = -5
Task T2
We will evaluate policy P1 by examining the 19 possible start points. We will focus on
weather there is a path to destination D from this start point S and is it the shortest path.
Conclusion: All starting points had a path with shortest distance in policy P1. And
there was total 37 paths that led starting points to destination. Out of which 29 were
shortest paths, resulting in about 78% of the solutions being optimal.
Task T3
1. Policy P2
R1 = 10
R2 = -50
r = -5
For T3 part 1, we ran the value iteration using destination reward R1 = 50, negative
reward for hazard R2 = -50 and with live-in reward of –5. The policy P2 did have some
changes, but on closer observation all changes appeared near the borders. With the
help of following table we can see how effective P2 was and compare with P1.
Policy 2 also has paths from all starting points and also has an available shortest path
from each starting point to the destination. But due to change in some directions the
no. of paths from each starting point changed slightly.
Similar to policy P1, P2 had total 37 paths and 28 paths were optimal with shortest
distance. Resulting in 76% of paths being optimal, which is slightly less than P1 that
had 78% but almost same.
Possible reason: This may have happened due to the increase in the negative reward.
In first policy we can see that the impact was limited to one block over the peripheral of
Hazard H and now it transmitted futher in the grid.
2. Policy P3
R1 = 100
R2 = -500
r = -5
For T3 part 2, we ran the value iteration using destination reward R1 = 100, negative
reward for hazard R2 = -500 and with live-in reward of –5. The policy P2 did have some
changes, but on closer observation all changes appeared near the borders. With the
help of following table we can see how effective P3 was and compare with P1 & P2.
Policy P3 too has valid paths for all starting points but fails to provide shortest possible
path for each starting point. So it may not be as optimal as P1 & P2 but still succusseds
to provide a valid policy with 84% of the time providing shortest path.
In total it provides 33 paths from starting to finish with 58% of the time having showing
shortest paths. Which is significantly lower than other two policies P1 & P2.
Possible reason: The difference in the magnitudes of exit reward, hazard (R1 & R2) to
the magnitude of live-in reward r. As R1 and R2 were in 100s and live-in was only –5 the
agent can take it’s time in finding the path to get a greater reward or to avoid major
hazard.
3. Policy P4
For each of the above task we achieved a policy that has a path to destination from
every possible starting point. But, we can rule out policy P3 as compared to P1 & P2 in
terms of optimization. As it failed to provide shortest path to each starting point.
R1 = 10
R2 = -5
r = -1
With R1 =10, R2 = -5 and live-in reward –1, we were able to get a policy that has shortest
paths to all starting points to destination D. And total of 35 paths i.e is less than P1 & P2
but 27 out of them were optimal. It had almost same fraction of paths that were optimal
as P1 & P2.
T3: Part 3
R1 = 50
R2 = -50
r = -1
With R1 50 and R2 –50 and live-in reward r –1 we got almost same results, it got 36
paths with 80% of them to be optimal. That is the best till now, but by only 0.1 to 0.3
margin. And mostly because it has less no. of paths in total from starting to final
positions.
T3: Part 3
R1 = 100, R2 = -500, r = -1
Starting Path Shortest Path Is No. of No. Of
point exists path length shortest paths shortest
(row, col) path paths
(0,1) Yes 2 2 Yes 1 1
(0,2) Yes 1 1 Yes 2 1
(1,0) Yes 4 4 Yes 1 1
(1,1) Yes 3 3 Yes 2 2
(1,2) Yes 2 2 Yes 2 2
(1,3) Yes 1 1 Yes 1 1
(1,4) Yes 2 2 Yes 1 1
(2,0) Yes 5 5 Yes 2 2
(2,1) Yes 4 4 Yes 3 1
(2,3) Yes 2 2 Yes 3 1
(2,4) Yes 3 3 Yes 2 2
(3,0) Yes 6 6 Yes 1 1
(3,1) Yes 5 5 Yes 4 2
(3,2) Yes 4 6 No 3 0
(3,3) Yes 3 3 Yes 2 1
(3,4) Yes 4 4 Yes 1 1
(4,1) Yes 6 8 No 2 0
(4,2) Yes 5 5 No 2 0
(4,3) Yes 4 6 No 1 0
Fraction 19/19 = 1 15/19 = 36 20/36 =
0.79 0.56
table 6: R1 = 100, R2 = -500, r = -1
This was not at all necessary as we have already ruled out this combination of R1 and
R2, but we can see it was only able to find shortest path to 70% of starting positions and
only 56% of total paths were optimal.
-5 and –1 gave almost same results, but if we go with numbers we would prefer –1. But
according to our observation the values of R1 and R2 effected more than r.
Task T4
Initialize:
Transition probabilities P(agent, s' | s, a) for each agent, state-action pair (s, a)
Update the value function for state s for the current agent: