Maze Solving AI
Maze Solving AI
Artificial Intelligence
https://github.com/jochuchemon7/Maze_Project_AI801.git
TABLE OF CONTENTS
1 ABSTRACT
2 INTRODUCTION
3 MAZE ENVIRONMENT
research 3.1
implementation 3.2
4 DEPTH-FIRST SEARCH
research 4.1
implementation 4.2
5 BREADTH-FIRST SEARCH
research 5.1
implementation 5.2
6 INFORMED SEARCHES: A* AND MODIFIED DEPTH-FIRST
research 6.1
implementation 6.2
7 REINFORCEMENT LEARNING
research 7.1
implementation 7.2
8 DATA TABLE AND CONCLUSIONS
9 BIBLIOGRAPHY
-------------------------------------------------------------------------------------------------------------------------------
2: Introduction
The focus of this report is the ability for maze-solving agents to traverse simple two-
dimensional arrays using different methodologies, each with comparable levels of efficiency. In
modern computer science, this is typically known as “tree traversal,” a system in which an agent
expands (or visits) a set of select nodes in a particular linear order.
To achieve this computation, a program
creates a stack or queue of potential nodes to visit
with the order changing depending on which search
method or algorithm is used. The nodes are then
called, typically by a recursive function. Figure 1
displays the manner in which search trees are
visualized through multiple levels, or depths, of
potential nodes. As node 1 (the parent node) is
expanded, nodes 2 and 3 (child nodes) become
queued next as potential nodes to expand,
continuous for the whole tree until either all nodes
are expanded, or a desired node is found. This process is converted most easily into a stacked list
in order to form the entire environment of potential solutions.
This tree-search methodology forms the basic structure for the maze problem’s Depth-
first Search, Breadth-first Search, A* Search and modified Depth-first Search with heuristic
implementations, each differing only in terms of how the order of nodes is selected for
expansion. The similar structure of each search method allows quantities to be compared, such as
the number of nodes expanded before a solution is found, calculated as the sum of nodes
explored on the path to the goal node.
Our implementation of Reinforcement Learning uses a separate maze environment with a
more restricted size. This size restriction is necessary to allow the program to run numerous
training iterations so that the agent can learn the path to the goal node. While Reinforcement
Learning is a more complicated process than the other search methods, it provides an interesting
point of comparison for maze solving strategies with different levels of complexity.
Regardless of the implementation method under study, the formulation of the maze
problem remains consistent:
States: the single unit nodes - nx
Initial State: starting node generated in the top left (always node 1) - n1
Transition Model: the movement from one node to another
Goal State: ending node generated in the bottom right corner - ng
Action Cost: 1 unit (to remain consistent)
According to this formulation, any maze solving agent’s goal is to move from state nx to
state ng with the lowest action cost, or lowest number of nx expanded. As the complexity of the
maze environment is altered, a larger cost is likely to be incurred overall; however, each search
method will differ in terms of how the cost is affected by scale.
This report presents both relevant abstract research and practical implementation of each
search method in the maze environment. It concludes with a summary of data generated by the
program, whether these findings are consistent with relevant research, and what the data indicate
for maze problems and search algorithms in general.
3.2: Implementation
There are a number of key fields significant to the implementation of our maze
environment. These play a role in most search algorithms:
Maze: contains the pattern of ‘w’ (wall) and ‘p’
(passage) randomly generated matrixes that make up
possible pathways for the agent. These are converted
into 1s and 0s in order to be tested by the agent as it
searches for new nodes to expand. It is necessary for
the environment to contain at least one possible
pathway at all times. Commonly, the agent will find
multiple possible pathways as it moves through the
maze, if they exist. It is the job of the search algorithms
to decide which paths take preference (or multiple
paths as in the case of BFS searches). Note that in our
implementations there is only one pathway through the maze from start to goal. Figure 2 shows
the terminal output of this maze data representation.
Visited: a list of all nodes in the maze and whether or not they have been visited by the agent.
Node: the current node that the agent “occupies.” This begins as the start node.
Target_node: the node that the agent must reach to exit the maze and find a solution
Order: contains the order in which nodes are expanded. This data is called for animation.
Our implementation of the maze
environment makes use of matplotlib.pyplot,
numpy, and matplotlib.animation packages to
create the visual representation of our maze, as
depicted in Figure 3. Every generation of this
environment will be random, meaning there
will always be a different output for pathing
possibilities for each search algorithm. This
allows for the collection of quantitative data
including average number of nodes expanded in
the Visited list. Additionally, the starting node
and Target_node remain the same in each
implementation so that the efficiency of pathing
is tested – having a target_node randomly
generated next to the starting node would throw
off the data with major outliers. Overall, this
maze strategy provides a useful environment
for manipulation.
4: Depth-first Search
4.1: Research
The depth-first search works by expanding the first node it finds, continuing down each
depth-layer until it reaches the depth farthest from the top, then backing up until all nodes are
expanded or a solution is found.
Depth-first searches can yield important
information about the structure of the entire graph
by revealing potential values for the lowest depths
or edges associated with lower depths (Cormen
2001, 603). Depth-First searches are more likely
to find goal nodes at lower depths quicker than
other search methods. However, its drawback is
apparent when the goal node happens to be at a
surface level-depth. DFS is the most likely of all the tested search algorithms to follow the
natural path of a human solving a maze, whereby the solver checks long pathways rather than
turning around each corner at surface levels.
4.2: Implementation
The Depth-First search is one of the simplest algorithms to implement due to its short,
recursive or repetitive structure. In our maze environment, we’ve created a DFS algorithm by
iterative checks of neighboring nodes, returning the node that has a further depth and repeating
the process until the goal node is found. In the maze environment, it is necessary to treat each
potential movement (up, right, left, and down) as potential nodes, then sift through them
depending on if they have been visited, if they are a wall or not, and if they would range outside
of the maze environment. In the DFS, as soon as the first node is validated, it will be expanded
along with its first valid child node ad infinitum. Figure 6 displays this structure.
5: Breadth-first Search
5.1: Research
While the breadth-first search is known as the simplest search method, it is not likely to
solve the maze problem as efficiently as the depth-first search method. This is because all
potential nodes per depth must be expanded before moving to lower depths, a process fulfilled by
simple while and for loops (Cormen 2001, 594). In the worst-case scenario, this process will
incur a O(bd ) runtime and memory cost (Korf 1985, 99). Figure 8 shows the pathing of this
algorithm.
In the context
of a complex maze
environment, we are
more concerned with
memory and time
usage than node
expansion cost. One strength of the BFS is its capability to find the cost-optimal path by
expanding all nodes at each depth of the search tree on the path to the goal state (Russell 2021,
3.4.1). A maze does not necessarily induce a cost for making a turn in the same way that, for
instance, a unique cost may be applied to a car finding directions to and from a city.
Theoretically, the cost of each move will be equal, meaning the most impactful qualities of our
maze will be the memory and time consumption of the search method itself, especially as the
maze complexity increases.
An additional consideration associated with BFS is the starting point of the search. The
traditional top-down BFS is restricted by many of the costs mentioned above. However, a
bottom-up search has the potential to reduce unnecessary node-searches in instances where top
depths are very broad (or numerous in potential node expansions) since it does not require
expanding all neighbors when a parent node is found (Beamer 2013, 3). This is likely to make
the efficiency of each instance of BFS have high variation.
5.2: Implementation
Korf draws attention to the unscalable property of the BFS search method, as a modern
computer performing and storing a million states per minute could quickly exhaust its storage
capability in a matter of minutes (Korf 1985, 99). As a result, BFS may be best implemented in
small and less complex maze problems. Our implementation reflects these limitations.
In Figure 9, one can see that the relative complexity of Breadth-first searches in maze
environments is greater than that of Depth-first searches. This is because as nodes are expanded
in the environment, a situation may occur where there are various possible valid paths (left, right,
up, or down) that expand into another set of various paths – a list containing a list. Therefore, to
queue nodes in the proper order for BFS, multiple lists must be checked – here node_list,
new_node_list, and each node itself (since our maze remembers nodes as two-integer lists).
Figure 10 shows the pattern this creates.
This search method in maze
environments can clearly be seen to expand
almost all nodes (or possible pathways) before
reaching the goal node. As a result, its runtime
efficiency will be drastically lower than any
other search method. As the complexity of the
maze environment increases, BFS will be
unable to scale, and efficiency will plumet. The
only advantage of Breadth-first search is that it
will expand surface nodes quickly, meaning if
the solution happens to be near the surface, the
agent will not skip over it as is likely to occur
in the case of Depth-First search. Figure 10
represents the worst-case scenario in which the
goal node (green dot in the bottom right corner)
is opposite the start node, requiring a full
expansion of the maze. For quantitative
analyses of BFS, see Conclusions.
6.2: Implementation
The most important aspect of implementation in A* is setting the cost of the function as it
traverses nodes, as seen in Figure 12. In our implementation, the path cost for each move from
one node to the next is equal to one and Manhattan distance is applied as the heuristic.
The impact of information in the form of an appropriate heuristic is clearly visible in our
implementation of the A* search algorithm. The algorithm uses a priority queue data structure,
adding and storing unvisited nodes adjacent to previously visited nodes in the queue. A selection
process is applied to for each move to a new node,
namely the algorithm evaluates the evaluation
function (f(n)) values for all nodes on the frontier and
systematically moves to check each node with the
lowest associated f(n) value. The search tends to
follow the current branch toward the goal node for
long periods, until the path either encounters or turns
away from the goal node or dead ends. Shifts up the
search tree, even close to the starting node, occur as
the search reorients to find the likely optimal path. In
many cases, however, node explorationhigher in the
tree (closer to the staring node) is limited, and the
search returns to depth.
7: Reinforcement learning
7.1: Research
Reinforcement Learning is a broad topic which, in essence, tries to solve traditional
dynamic programming problems by using a state-action function and finding the right policy
instead of trying to split a big problem into subproblems and perform traditional recursion on
them. For this to occur the general framework requirements include an agent that will take an
action from a list of possible actions inside an environment, which will be determined based on
the state that it receives from the environment.
The agent’s action causes changes to the environment in which the agent is operating. In
return, the environment grants a reward to the agent for its action, as well as a new state for the
agent to evaluate in determining its next action. The agent will now be in the new state that the
environment returns, and we can form observations with the current state, action, reward and
next state. Depending on the algorithm that one uses, the agent will use the state and reward
differently. In our maze project we decided to use a simple DQLN or a Deep Q-Learning
Network. This algorithm makes use of a neural network to take in the state that the environment
passes to it, and it uses the reward to evaluate the neural network to calculate what are called Q-
Values using a Q-Function, which in our case is the neural network. The number of Q-values is
the same as the number of actions that the agent can take and, depending on the policy, the
implementation will pick a Q-value which is paired with the action that the agent will take. We
also have an optimizer for our NN to perform gradient decent and improve the model.
The code includes the option to tune the number of training episodes for the simulation as
well as the gamma value (discount factor) which weights the relative impact of future state-
action rewards and those accumulated by the agent at the beginning of the episode. The higher
the discount factor the more that future observations will affect updates to the Q-Function since
reward is multiplied by the discount factor. The policy will be the one that eventually picks the
Q-value and its corresponding action. One example is the Epsilon-Greedy policy, in which we
set the variable ‘epsilon’ to a value between 0 and 1 and constantly decrease the value after each
episode, usually by a factor of one divided by the number of episodes. Then with a probability of
the epsilon value we pick a random q-value and with the complementary probability we pick the
maximum q-value.
7.2: Implementation
As discussed above, we decided to use a simple DQLN approach (Deep Q-Learning
Network model) for our maze-solving problem. We first needed to make some modifications to
our maze environment to accommodate the model. One of the most significant changes that we
made to the initial maze was to properly include an agent in our environment. For this step, we
created a new agent class that includes the position and name of the agent as well as a list of cells
visited, whether an agent’s move was valid or invalid, and an invalid movement counter. We also
created a goal class with just the name and position of the goal in the maze. These two classes
were created through a function that added the class objects in a key-value dictionary for the
agent and goal. We also included a step function that takes in the agent’s pending action,
evaluates the action through a check step function and, if the step is determined to be valid,
marks the action as valid and changes the agent’s position. If the step is determined to be invalid,
the step function marks it as invalid, includes the action in a list of invalid cells, and adds one to
the counter. The invalid list and counter are reset when an action is valid.
We also included a reward function that will return a value of –0.7 if the agent’s position
ends on the start node or if it hits a wall, a reward of –0.25 if the new position is a valid one but
already visited, a reward of –0.04 if the new position a valid and not visited, and a reward of 1 if
the agent reaches the goal state. The rewards are not greater than 1 or less than –1 because we are
using a Neural Network for our Q-Network, which works best when dealing with values that are
normalized or less than 1. Finally, we modified a self.get_maze_() function so that instead of
returning the [nrow x ncol] matrix we return a [nrow x ncol x 3] matrix. This new matrix
functions as the ‘state’ that the environment will pass to our model, essentially three mazes
representing the agent, goal, and walls respectively as ones and the rest of each grid as zeros.
That way, the model perceives the constant movement of the agent and the goal location.
That entire [nrow x ncol x 3] grid is flattened out into a [1, nrow x ncol x 3] matrix,
which is passed to the neural network to generate a q-value. The epsilon-greedy policy uses the
q-value to pick an action and get new observations. We set our epsilon value to 1 and decreased
it by a value of one over the number of episodes so that our agent slowly will pick the best action
based on the state passed over. With a threshold of 0.1 so there will be a small probability of
picking a random action just in case it might find one better in the future.
We also set a value for the maximum number of steps that the agent can take before the
episode is deemed failed or lost. This is so that we can encourage the agent to find a solution fast
and avoid getting stuck in the maze or worse finding a solution that might include going into
more cells than it needs to.
Overall, we found that the DQLN algorithm
was successful at finding the best path to the goal in
small mazes, really anything less than 11x11. We
thought about incorporating experience replay and a
target network, but we ran out of time for that. We also
thought about different algorithms like PPO and Actor
Critic, but we figured a simple DQN could do just fine.
A more important thing that we thought could have
made a difference was the input itself to the model.
Instead of passing a [nrow x ncol x 3] matrix we could
have passed the maze in its color-coded values and run
some convolutional layers before the linear and
nonlinear layers which could have done a better job at
converging on bigger mazes.
8: Conclusions
The following data tables contains information for our four search algorithms: Breadth-
first search, Depth-first search, A* Search, and Depth-first search with heuristics. The data
represent the number of notes expanded before the goal node is found, measured as the length of
the “order” list in each program. Each algorithm made use of a 50 X 50 sized randomized maze,
in order that the number of nodes expanded would be comparable. We ran 30 tests of each
algorithm to find the average number of nodes expanded. The data tables and graphs in figures
16 through 18 summarize our findings. Efficiency in terms of the maze problem is defined as the
number of unnecessary nodes expanded before a goal node is found. BFS is the most inefficient,
followed by DFS, then DFS with a heuristic, and finally A*. Of
particular interest in this data is the tendency for BFS and A*
and DFS with Heuristic to produce both high and low outliers,
mainly because of the pathing some randomized mazes would
produce. In the case of BFS, we have efficient outliers when
there were few outbranching paths from start node to goal node.
In this situation, a DFS maze would perform worse than a BFS
because it would be likely to follow a path that cannot rejoin to
the goal-node path.
Because DFS with a heuristic is similar to A* star in terms of the mean number of nodes
expanded, we have performed a two-sample T test for difference in means to see whether we
could consider their difference in performance significant. At the 5% significance level, the p
value is greater than .05, meaning we cannot reject the null hypothesis that the two-population
means are nearly equivalent. For this reason, while the A* seems to be slightly more efficient
according to the data, we could consider the two algorithms to be statistically equivalent.
The greatest limitation we have seen on this project is the inability for Reinforced
Learning to capture larger maze environments. Around the 10 X 10 size, the speed of RL drops
dramatically, whereas the other search algorithms, especially the A* and DPS scale nicely.
Another limitation to consider is that our data does not apply to all maze environments, but
rather, it shows information only pertaining to Prim’s algorithm. In the case of other maze
environments, it is possible to run into other issues, such as pathing that includes infinite loops.
Likewise, because of the scope of the project, we were not able to include other forms of
complexity such as more than one agent. Regardless, the data we have obtained is useful in
testing how each algorithm responds to Prim’s algorithm mazes.
A maze environment constructed with Randomized Prim’s algorithm provides a useful
tool for testing the efficiencies of search algorithms and reinforced learning implementations.
Some of the peculiar characteristics of the maze come through in our data, and we see that the
randomized pathing plays a large role in determining which implementation is most efficient.
Although reinforcement learning may be used to find the most efficient pathway to the goal
node, it is not necessarily optimal for solving the problem because of the time complexity
involved in the learning process which involves a lot of hyperparameters to tune for. In fact,
comparing RL to our search algorithms shows that “simple is better” in some cases. If we solely
evaluate the most practical approach to solving our implementation for the maze problem, A*
search stands out as the most optimal algorithm of those we tested.
9: Bibliography
Algfoor, Zeyad Abd, M. Sunar and H. Kolivand. 2015. “A Comprehensive Study on Pathfinding
Techniques for Robotics and Video Games.” International Journal of Computer Games
Technology. Article ID: 736138. http://dx.doi.org/10.1155/2015/736138
Beamer, Scott, Krste Asanović and David Patterson. 2013. “Direction-optimizing breadth-first
search." Scientific Programming 21: 137-148.
“Binary Tree Level Order Traversal.” n.d. AlgoMonster. Accessed March 11, 2023.
https://algo.monster/problems/binary_tree_level_order_traversal.
Biswas, Souham. 2020. “Maze solver using Naïve Reinforcement Learning.” Towards Data
Science. https://towardsdatascience.com/maze-rl-d035f9ccdc63
Chrestien, Leah, Tomáš Pevný, Antonín Komenda, and Stefan Edelkamp, 2022. A Differentiable
Loss Function for Learning Heuristics in A. arXiv preprint arXiv:2209.05206.
Chrestien, Leah, Tomáš Pevný, Antonín Komenda, and Stefan Edelkamp, 2021. Heuristic search
planning with deep neural networks using imitation, attention and curriculum learning.
arXiv:2112.01918.
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001.
Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill. ISBN 0-262-
03293-7.
Dechter, R. and Pearl, J. 1985. “Generalized best-first search strategies and the optimality of
A*.” JACM, 32, no. 3: 505–536.
Even, Shimon. 2011. Graph Algorithms, Second Edition. Cambridge University Press. ISBN
978-0-521-73653-4.
Daniel Foead, Alifio Ghifari, Marchel Budi Kusuma, Novita Hanafiah, Eric Gunawan. 2021. A
Systematic Literature Review of A* Pathfinding, Procedia Computer Science 179,
507-514.
Korf, Richard. 1985. “Depth-First Iterative Deepening: An Optimal Admissable Tree Search.”
Artificial Intelligence 27. 97-109.
https://academiccommons.columbia.edu/doi/10.7916/D8HQ46X1
Mnih, V., Kavukcuoglu and K., Silver, D. et al. 2015. “Human-level control through deep
reinforcement learning.” Nature 518: 529–533. https://doi.org/10.1038/nature14236
Moerland, Thomas M., Joost Broekens, Aske Plaat, Catholijn M. Jonker. 2022. Model-Based
Reinforcement Learning: A Survey (v4). arXiv:2006.16712v4.
Poole, David and Alan Mackworth. 2017. "3.5.3 Iterative Deepening” in Artificial Intelligence:
Foundations of Computational Agents, 2nd Edition. Cambridge University Press.
Prim, R. C. 1957. “Shortest Connection Networks And Some Generalizations.” Bell System
Technical Journal 36: 6.
Russell, Stuart and Peter Norvig. 2021. Artificial Intelligence: A Modern Approach. Pearson
Series in Artificial Intelligence, 4th Edition. NJ: Prentice Hall.
Sutton, Richard S and Andrew G. Barto 2018. Reinforcement Learning: An Introduction. ISBN
0262039249
Zai, Alexander and Brandon Brown. 2020. Deep Reinforcement Learning in Action. ISBN
9781617295430