0% found this document useful (0 votes)
57 views

A Deep Learning Approach To The Geometry Friends Game (Artículo)

This document describes four different artificial intelligence approaches developed to play the Geometry Friends game using deep learning techniques. The approaches include: (1) A reinforcement learning agent that divides levels into subproblems and solves them sequentially; (2) A depth-first search agent that plans character paths through levels; (3) Two additional reinforcement learning agents. The approaches are compared to evaluate their advantages and limitations for playing Geometry Friends.

Uploaded by

Jc Llamas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

A Deep Learning Approach To The Geometry Friends Game (Artículo)

This document describes four different artificial intelligence approaches developed to play the Geometry Friends game using deep learning techniques. The approaches include: (1) A reinforcement learning agent that divides levels into subproblems and solves them sequentially; (2) A depth-first search agent that plans character paths through levels; (3) Two additional reinforcement learning agents. The approaches are compared to evaluate their advantages and limitations for playing Geometry Friends.

Uploaded by

Jc Llamas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A deep learning approach to the Geometry Friends game

Daniel Brandão Francisco Melo João Dias


Instituto Superior Técnico Instituto Superior Técnico Instituto Superior Técnico
Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal
[email protected] [email protected] [email protected]

ABSTRACT and exploring different deep learning approaches to Geome-


This work focuses on creating single-player artificial intelli- try Friends.
gent agents for the game Geometry Friends. Four different
solutions were crafted based on deep neural networks using GEOMETRY FRIENDS
both supervised and reinforcement learning approaches. The Geometry Friends is a platform level based game. The levels
main purpose of these solutions is to create an agent that is can be created by manipulating the different obstacles and
able to select the best action given a preprocessed snapshot characters initial position using Geometry Friends level editor
of the current state of the game, learning to control the ball tool.
character and interact with the several obstacles in Geom-
etry Friends. The different agents are then compared with The obstacles and platforms in the game can obstruct only
each other and other single-player agents submitted in previ- one or both players. The black zones block the path of both
ous Artificial Intelligence competitions in order to study and the circle and rectangle characters. The green zones block
understand why each crafted solution may or may not be a the path of the yellow circle. The yellow zones block the path
of the green rectangle. Some platforms may also have some
good solution for Geometry Friends, covering their advan-
predetermined motion. This features add an extra complexity
tages, disadvantages and limitations.
to the cooperation game. In Figure 1 we can see an example
of a Geometry Friends level. The objective of the game is to
Author Keywords collect all the purple diamond shaped tokens that exist across
Artificial Intelligence; Geometry Friends; Deep Learning; the level.
Supervised Learning; Reinforcement Learning;
Convolutional Neural Network

ACM Classification Keywords


I.2.1. Learning: Applications and Expert Systems

INTRODUCTION
Geometry Friends is a 2D platform cooperative puzzle game.
The main objective of the game is to collect all the diamonds
present in the level in the least amount of time. The players
control either a yellow circle or a green rectangle character,
each one with a different set of actions.
Geometry Friends has also physics-based game environment,
with features such as acceleration and friction. This dynamic
environment requires a certain amount of skill and coordi- Figure 1. Example of a Geometry Friends level (extracted from http:
nation from both players in order to successfully complete a //gaips.inesc-id.pt/geometryfriends/?page_id=26).
level. This creates a complex challenge to the players, mak-
ing it an interesting case-study for Artificial Intelligence (AI). Both the yellow circle and the green rectangle characters have
a set of three actions, like depicted in Figure 2. The circle
For these reasons, creating an agent capable of playing many character can roll left or right and jump. The circle character
levels of this game successfully is a very difficult and interest- levels are more associated with control challenges since its
ing task and why the core of this work is developing, studying speed influences its jumping technique. The rectangle char-
acter can slide left and right and change its shape to an hor-
izontal or vertical rectangle, maintaining the same area. The
track for this character is associated with more puzzle-solving
levels.

1
a certain state of the game. SP2 is solved using a depth-first
search (DFS) to find the character’s path through the level
platforms. In SP1 and SP3, reinforcement learning is used.
To represent the state of the game a set of features is captured.
The agent decision flow starts with trying to solve the current
platform trying to collect all diamonds assigned to it. After a
platform is solved the agent decided which platform to solve
next, proceeding to move toward it.
Figure 2. Geometry Friends characters actions (extracted from http:
//gaips.inesc-id.pt/geometryfriends/?page_id=26).

RELATED WORK
Artificial Intelligence in Geometry Friends
RRT Agent
Rui Soares and Franscisco Leal submitted a Rapidly-
exploring random tree (RRT) approach for the individual lev-
els of Geometry Friends, for both the rectangle and circle
characters [4].
Solving a level is divided into two stages: planning and con-
trol. The planning phase consists, like previous approaches,
Figure 4. Example of platform division in the RL solution (extracted
identifying important points in the level, like end of platforms from [9]). This picture illustrates the divide and conquer strategy used
and jumping points (example illustrated in figure 3). This in this solutions where the level is divided into four subproblems: P1, P2,
is done using the RRT search that returns a set of nodes or- P3 and P4.
dered and labeled according to important information to go
through that section like: turning point, fall, diamond above, Discussion
gap, jump and morph. The control phase of the solution takes RRT agent was able to complete path planning based levels
the information from the planning phase and it is responsible but failed some where a high control precision was neces-
to adapt the agent’s velocity and perform the special moves sary, possibly needing more work on controlling the charac-
of each character. A proportional integral and derivative con- ters during jumps and falls. The reinforcement learning ap-
troller (PID controller) is used to control the velocity in real- proach had control problems during jumps, possibly needing
time. more training. A diamond was assigned to the platform di-
rectly below it and the solution was designed to solve a plat-
form at a time, so floating diamonds that needed to be col-
lected while jumping or falling were not collected. Another
problem during reinforcement learning was the incomplete
state representation, where two states that are pretty differ-
ent and required a different set of actions are identified as the
same through the features.

Deep Learning
In Modern Gaming AI
A successful example of deep learning is the construction of
Figure 3. Example of the nodes marked in the RRT solution (extracted deep learning models for modern gaming [1]. In this work,
from [4]). The different nodes can be distinguished between: turning three different deep neural networks architectures were con-
points (TP), falls (F), morphs (M), diamond above (DA), gaps (G) and structed and supervised learning was used.
jumps (J).
The first was a simple CNN that receives as input the current
Reinforcement Learning Agent frame of the game. The second (early integration) receives
as input four consecutive frames. The third (late integration)
João Quiério created an agent for the Circle character [9].
uses four frames as input, each one with its own convolution
The approach is based on a divide and conquer strategy. The layer, merging at the end (Figure 5). These three models were
agent divides the problem into three sub-problems: solving constructed in order to study the most relevant architecture to
one platform (SP1), deciding the next platform to solve (SP2) learn to play modern games such as Super Smash Bros. The
and moving from one platform to another (SP3). A world third architecture topped the other two by having a higher val-
model of the level is created storing all platforms of the level, idation accuracy after 1 epoch of 15,000 batches (each with
diamonds assigned to them, a navigation map to decide where 25 frames) using supervised learning. Integration of frames
to go next and character velocity and size. The agent also has is important to extract dynamic time-based features such as
a knowledge base to select the best action to perform given velocity.

2
Figure 6. CNN architecture used to play Atari games (extracted from
[6]). The input has size 84 x 84 x 4. The first hidden layer convolves
Figure 5. CNN architecture with late integration (extracted from [1]). 32 filters of 8 x 8 with stride 4, applying a rectifier nonlinearity (blue
Each colored row is CNN for each frame. The first hidden layer con- symbol after the first hidden layer). The second hidden layer convolves
volves 96 filters of size 7 x 7 with stride 2, followed by rectifier units, 64 filters of 4 x 4 with stride 2 and the third convolves 64 filters of 3 x
normalization units and a pooling layer with 3 x 3 filters and stride 3. 3 with stride 1, both followed by a rectifier. The final hidden layer is a
Then it convolves again, creating 256 filters with size 5 x 5 and stride 1, fully-connected layer of 512 rectifier units. The output layer is also a
followed by rectifier units and a pooling layer with 2 x 2 filters and stride fully-connected layer with a single output for each valid action.
2. Convolves again 3 more times with 512 filters of size 3 x 3 and stride 1,
each followed by rectifier units. A final pooling is made with 3 x 3 filters
and stride 3. The network terminates by connecting all four CNNs in
three fully connected layers, the first two with 4096 units each and the AlphaGo
third with 30 units that output for all possible actions. Another successful example of using deep reinforcement
learning techniques is the AlphaGo agent [10]. The agent’s
deep learning development was divided into three main
Although it was able to learn character controlling and track stages.
opponent position there were some weaknesses observed.
One of them was a locality problem, the model had only a The first stage consisted of using supervised learning to di-
local view of the map but not a global one leading to the char- rectly learn from expert human moves. A thirteen layer
acter prioritizing chasing the opponent even if it meant fall CNN was trained with 30 million positions from the KGS
into a pitfall. This could be solve by adding to the input a set Go Server.
of features that would help the network to learn to position The second phase consistent improving the CNN obtained
the character globally. Another drawback was data limita- in the previous phase by using policy gradient reinforcement
tion. Adding different play-styles from many different play- learning. Policy gradient methods works with probability of
ers would help robust the model. Because supervised learn- action selection instead of working with Q-values. The train-
ing is used, the agent can only be as better as the ones who ing was done randomizing a pool of different opponents to
provided the data, resulting in an upper learning limit. This prevent over-fitting and stabilize training.
could be solved by using reinforcement learning techniques
to strength the model. The final stage of training was constructing a similar CNN
that would have a single output instead of a probability distri-
To Play Atari Games bution. This output would be the estimated value function of
To play Atari games, A deep q-learning algorithm was pro- each state. The network was trained by generating a new-self
posed [6]. It uses the state only as input and the output would play data set of 30 million position of separate games played
be the possible Q-values for each action. This allows to cal- by the RL agent against itself.
culate all estimated Q-values for each action in a certain state
in a single forward step.
Each frame is preprocessed creating an 84 x 84 sized tensor.
To get a grasp of important dynamic variables that are present
in Atari games like velocity, the input for the network is a a set
of 4 sequential frames. The network architecture is illustrated
in figure 6. A total of 50 million frames were used for training
from 49 different Atari games.
Experience replay is used, sampling randomly in the in-
ner loop random minibatches to obtain data efficiency and
smooth the learning process.
The agent performed at human-level in 29 out of 49 Atari
games, surpassing human experts in three of them. Figure 7. CNN training pipeline used on AlphaGo (extracted from [10]).

3
Another small network was created to obtain a rollout policy
for the Monte Carlos Tree Search (MCTS). This small net-
work received only global features as input and was faster on
selecting an action although a lot less accurately. The best ac-
tion selection mechanism was doing a 50% mixed evaluation
of the state by using the results from both the value network
and the MCTS rollouts.
In 2016 the agent was able to defeated the World’s Go cham-
pion Lee Sedol in four out of five matches.
Learning from Demonstrations
Many reinforcement learning solutions require a lot of data
until they reach reasonable performance. The Deep Q-
learning from Demonstrations (DQfD) solution [3] proposes
using a small quantity of demonstration data in order to ac-
celerate the learning process. This method combines both
supervised learning and reinforcement learning for the agent Figure 8. Comparison of DQN and DQfD in the Montezuma’s Revenge
to be able to learn how to imitate the demonstrator and have game (extracted from [3]). As it can be seen, DQfD outperforms by far
self-awareness of what actions will result in better rewards. both DQN and the imitation algorithms. The imitation algorithm cor-
responds to an agent trained using solely the demonstration data and
During pre-training, the agent learns using only human game supervised learning.
play data. After pre-training, the agent uses a mix of demon-
stration and self-generated data.
AGENT INTEGRATION
Because self-generated data is not able to cover all possi-
ble pairs state-action, using purely Q-learning would result in Main Components
calculating Q-values that are based on ungrounded variables Geometry Friends has an AI component that allows users to
and these would then propagate throughout the network. To integrate their own agents with the game. To do this, one
solve this, a large margin classification loss [8] was added. of the requirements is that the agent must be built using the
This loss forces the values of the other actions to be at least programming language C#. The tool chosen for deep learning
a margin lower than the value of the demonstrator’s actions. was the Microsoft Cognitive Toolkit ([5]). This tool was one
With this loss, the policy obtained is more likely to imitate the of the few tools that allowed network evaluation on C#, while
demonstrator. Along with this loss, a Double Q-learning [2], the training had to be done in either Python or BrainScript.
an n-step Q-learning [7] and an L2 regularization loss are also The agent was partitioned into three segments:
applied to the network’s parameters. The total loss used is in
equation 2 where λ parameters can be adjusted. All losses • Capturing - It is responsible for capturing, saving and map-
are applied to the demonstration data. Supervised loss is not ping the current state and action into files.
applied to self-play data (λ2 = 0). • Training - This component is responsible for training a net-
work using the different mapping files previously crafted.
JE (Q) = max[Q(s, a) + l(aE , a)] − Q(s, aE ) (1) • Acting - It is responsible to evaluate a network, obtain its
a
output and translate it into actions. This can be considered
in main component of the agent since it is the only segment
used after training is finished.
J(Q) = JDQ (Q) + λ1 Jn (Q) + λ2 JE (Q) + λ3 JL2 (2)
The acting component needs the network from the training
The demonstrator data resides in the replay buffer perma- component and it is often merged with the capturing com-
nently. The ratio of demonstrator and self-generated data ponent in order to produce data for future training. Figure 9
present and sampled from the replay buffer is important for illustrates a diagram that shows how this components interact.
the algorithm’s performance.
Capturing
It would take about 82 million steps in order for DQN to Game State
match DQfD’s performance after 1 million steps. DQfD use The representation for the game state chosen was a set of four
of demonstration data also allowed it to achieve state-of-the- consecutive frames. These frames were captured in a regu-
art results in 17 of the Atari games. Many of these games, lar interval inside the capturing component using the game’s
like Montezuma’s Revenge are hard to explore, proving that API.
exploring based on human data can improve significantly an
agent’s performance. The frames are then resized to a 150x90 dimension, preserv-
ing the original frame proportion. This size was chosen based
on the Geometry Friends level editor tool that segments a
normal level into 75x45 squares. If the resize was done to

4
an Intel Core i5-3360M CPU - 2.80GHz, this preprocessing
usually took around 40-50ms per frame. In order to make sure
the forward step would be concluded within the next capture,
the capturing interval was then chosen to be 80ms. This how-
ever meant that the action outputted would always be late the
amount of the preprocessing time. This cost was taken into
account during the data collection.
The 80ms time interval was also chosen in order to have a to-
tal of 240ms as timespan in each game state. Having a 240ms
timespan allows the agent to have a better perception of its
rotation and velocity. A longer time interval between each
frame would not be ideal because it could result in the lose
Figure 9. Diagram of agent’s components and its interactions. of important actions and in-game situations (such as jumping
points, places where the agent needs to learn where to jump).
Rewards
The rewards captured for the network training are 100 for the
terminal state if the agent concluded the level successfully, -
100 for the terminal state if the agent couldn’t complete the
level within the time limit, 100 for a non-terminal state when
Figure 10. Comparison between resize dimensions. At the left side is the
75x45 resizing and on the right side is the 150x90 resizing. the agent captures a diamond and -1 for the rest of the other
non-terminal states. Since the agent received less penalties
when it completes the level quickly, the negative reward in
a 75x45 dimension, the resulting state would lose too much non-terminal states helps the agent learn how to complete the
information. The comparison between both sizes can be vi- level as fast as possible.
sualized in Figure 10.
Actions
The graph in Figure 11 compared the accuracy of the agent The actions captured for the network training actually cor-
when using supervised learning with human game play data. respond to ”key presses” rather than the actual character ac-
The blue line shows its accuracy across the different epochs tions. To capture human game play, the recorded actions cor-
when using 150x90 frames as input and the red line shows respond to the keys a person presses. If no keys are pressed,
the accuracy of using 75x45 frames. As it can be noticed, the recorded action was no action.
with the same network and training parameters, the training
is faster and more accurate when using a bigger sized input. The actions captured are transformed in one-hot vectors. For
The time spent per epoch increases as the input size increases, example, the action Jump corresponded to one-hot vector
so a trade-off between input information and computational 0001. The actions along with the rewards, are saved and
time has to be made. mapped in files for the training phase.

Training
The main network architecture used was based on the net-
works used to play Atari games [6] and Super Mario Bros
[1]. The network used is depicted in Figure 12. The network
architecture was chosen based on early experimentation with
the Microsoft Cognitive Toolkit. This architecture was the
one that showed best results while performing the supervised
learning training.
Since each input frame has size of 150x90x3 considering each
RGB channel, the input layer has size of 150x90x12.
Using 150x90x12 results in using early integration. Early
integration was chosen in contrast to late integration
(150x90x3x4) in order to reduce the computational cost asso-
Figure 11. Line graph comparing both considered input dimensions. ciated by having four separate convolutional parts. The output
of the network translates into a probability distribution over
The time interval chosen to capture each frame was 80ms, the actions or the Q-values for each action, depending on the
meaning that if the game is running at around 60 frames training.
per second (fps), a frame is captured approximately every 5
frames. Since the preprocessing of the images happened in- A Nvidia GeForce GTX 1080 GPU was used to train the net-
side the agent’s update cycle in order to feed the network, works.
there is a processing cost associated with the resizing. Using

5
This solution tackles the supervised learning as a classifica-
tion problem. A normal classification problem consists on
correctly classifying information. In this solution, the game
state corresponds to the information that needs to be classi-
fied. The different agent’s actions are the classes. The main
idea is to make an agent able to classify each game state with
an action that corresponds to the best action to take in that
Figure 12. Network Architecture. The first convolutional layer has 128 state. For this, human game play data is captured and the
filters, using a 10 x 6 kernels with stride 4. The second convolutional agent is taught to classify according to human decision.
layer has 256 filters, using 4 x 4 kernels with stride 2. The third con-
volutional layer also has 256 filters, using a 3 x 3 kernels with stride 3. Due to the lack of such resources and experts on Geometry
All three layers apply a rectifier nonlinearity. The following two layers Friends, the data used was collected from 100 levels solely
are fully connected layers of 1024 neurons with 0.5 dropout. The output
layer is a fully connect layer of 4 neurons that correspond to the different by me. Each level was extracted from previous circle compe-
action classes. titions, designed by creators of previous solutions or by me
with the intend of capturing more simple scenarios. Each
level was played at least five times and the initial position of
Acting the circle character was changed regularly in order to create
The acting is done independently from the regular update cy- more diversified data.
cle of the agent. While the agent is executing, the game asks
the agent for an action and tells it to update in real time (as The train was performed on the network showed in Picture 12
fast as possible). using a cross entropy loss with softmax (Equation 4). This
function calculates the softmax of the network output and
After training, a purely greedy strategy is used for action se- then calculates the cross entropy between the softmaxed out-
lection, where the output of the network is fully exploited.The put and the target action.
choice made for action selection during training was to per-
form an -greedy strategy every 80ms when a new output is
given by the network. If the strategy selected a random action, eoutputj
this action is repeated until the next output is given. If the out- sof tmax(output)j = P4 (3)
outputk
k=1 e
put is chosen, two different things can happen depending on
the solution. If the output of the network is a policy, a new
action is repeatably sampled according to each action proba-
4
bility in that policy until -greedy strategy is called again. If X
the output are Q-values, the best action is repeated for 80ms. loss(output, target) = − targetj log(sof tmax(output)j )
j=1
The intention of repeating an action for 80ms was to improve (4)
the impact of random actions in the exploration of the agent.
During training there is an higher chance of choosing random
actions in order to improve exploration. After each training Reinforcement Learning Agent: Q-learning (QL)
step, this chance is lowered. This solution uses reinforcement learning techniques based
on the DQN algorithm. This solution consists on crafting a
Since in a normal Geometry Friends level, the frequency of network that is an approximation of the Q-function that out-
the jump action when compared to the other actions is consid- puts an expected reward for a pair state-action. In this solu-
erably inferior and in order to avoid the agent to keep jumping tion, given a game state, the network outputs a Q-value for
around the level, the random action distribution is not uni- each possible action. The action with higher Q-value is the
form. Table 1 shows the chosen distribution of the random action that is expected to give the agent an higher reward.
actions based on data collected during supervised learning.
In all reinforcement learning solutions, a total of three train-
Table 1. Chosen action distribution ing iterations were made. In the first iteration the data con-
Actions Action Probability sisted on information gathered through self-play using the
No Action 15% supervised learning agent. The supervised agent allowed a
Roll Left 40% more effective exploration of the state-space, performing bet-
Roll Right 40%
Jump 5% ter than an initial random exploration. Following iterations
used the previously trained network to explore each level.
In the first two iterations, the agent trained in simpler levels.
SOLUTIONS This was made in order to achieve a gradual learning pro-
gression where the agent first learns the simpler mechanics
Supervised Learning Agent
and then attempts more difficult ones.
This agent was created using purely supervised learning.
Based on the work in Super Smash Bros. [1] and the su- For the first two iteration, the action selection strategy used
pervised learning agent from AlphaGo [10], this agent learns  = 0.9 in the action selection strategy. In the final iteration,
from human-generated data.  = 0.95 was used.

6
To avoid catastrophic forgetting in our training across differ- Reinforcement Learning Agent: Demonstrations
ent iterations, all transitions captured in previous iterations This solution was based on the DQfD algorithm [3]. The dif-
are used in the following iterations. ference of this solution from the pure Q-learning solution is
that this solution contains human game play data. This data is
The train was performed on the network showed in Picture
used as demonstrations for the agent and is always be present
12. Each target Q-value was calculated based on the DQN
throughout the agent’s training. This solution uses human
algorithm as showed in equation 5. The function represented
data in order to improve and speed up the agent’s learning.
in equation 6 was used as training loss.
The algorithm uses four losses during the agents training. In
this solution the training loss used was adapted and the n-step
if s0 is a terminal state

r Q-learning loss was removed due to the incompatible way the
ys,a =
r + γ maxa0 Q̂(s0 , a0 ) otherwise data has been captured so far. Due to the non-existence of a
(5) memory replay, the whole set is manually managed in order
to have close to an equal amount of demonstration and self-
play data.
loss(s, a) = (ys,a − Q(s, a))2 (6) The training loss used is depicted in equation 13. The func-
tion depicted in Equation 12 is a supervised loss for the hu-
Based on the DQN algorithm, a target network Q̂ is also used. man data only. Unlike the second solution, the Q-values
This network is initialized with the same parameters as the are calculated based on double Q-learning [2] represented on
main network Q and updated every 5 epochs. Equation 11. This method was chosen in order to reduce the
overoptimisms of Q-learning, potentially providing a more
Reinforcement Learning Agent: Policy Gradient (PG)
reliable and stable learning.
This solution was based on A3C algorithm [7] and the RL
agent from AlphaGo [10]. In this solution, two networks are
trained. A policy network that is initialized with the super-
if s0 is a terminal state
(
r
vised networked obtained in the first solution and a value net- ys,a =
work. The value network is similar to the network structure r + γ Q̂(s0 , argmaxQ(s0 , a0 )) otherwise
a0
used so far but has a single linear output. This value net- (11)
work informs the agent of the expected value to be obtained
in each state. This network is used in order to compare the
reward received with its output and calculate the advantage. supervisedloss(s, a) = max[Q(s, a0 ) + l(a, a0 )] − Q(s, a)
The advantage is what either enforces or suppresses a certain 0 a
action in a given action distribution. (12)
Usually the advantage is the difference of target Q-values and
the value expected. In order to not have to calculate Q-values, totalloss(s, a) = supervisedloss(s, a) + (ys,a − Q(s, a))2
discounted reward was used as an estimate for these values (13)
during both networks training. This discounted reward was
calculated using Equation 7. Training Parameters
Table 2 showcases the training parameters chosen for each
2 n−1 solution.
R = r1 + γr2 + γ r3 + ... + γ rn (7)
Table 2. Training Parameters
The function represented in Equation 8 was used as training Parameters Supervised QL PG Demonstrations
loss for the value network. The function represented in Equa- initial learning rate 1e-4
tion 10 was used as training loss for the policy network. epochs 100 500
update rule Adam optimizer
minibatch size 64
discount factor - 0.95 0.99 0.95
valueloss(s) = (R − V (s))2 (8) margin - 10
L2 regularization - 0.002

Advantage(s) = R − V̂ (s) (9)


RESULTS
In order to evaluate our solutions performance, the agents
were tests on a level set of 16 different stages. In order to
policyloss(s) = − log(π(s)) × Advantage(s) (10)
compare the agent’s performance across many different lev-
els, the level set is divided into 4 different groups:
Both networks were trained at the same time. In order to de-
crease instability, a target network V̂ was used for the policy • Group I consists on simple levels where the agents trained.
network update. This target network was updated every 5 This group has the objective of checking if the agent’s were
epochs. able to perform simple tasks. Time limit 60s.

7
• Group II consists on complex levels where the agents Table 3. Results of the Supervised Learning Agent
trained. This group has the objective of evaluating the Levels Supervised QL PG Demonstrations
agent’s capacity of completing complex levels were they I-1 262 0 245 251
trained. Time limit 240s. I-2 270 0 25 106
I-3 306 0 34 108
• Group III consists on a set of simple levels where the agents I-4 464 0 402 366
did not train. This set has the objective of analyzing the II-1 27 0 151 138
agent’s capacity of identifying certain features from levels II-2 269 110 0 200
the agent trained on. Time limit 120s. II-3 582 30 70 194
• Group IV consists on a set of levels where the agents did II-4 372 30 90 100
not train. This set has the objective of analyzing the agent’s III-1 160 0 10 30
performance in an unfamiliar environment. Time limit III-2 100 90 60 158
300s. III-3 293 0 204 79
III-4 100 0 90 110
The levels are depicted in Figure 13 with their respective time IV-1 30 30 150 140
limit. The agents ran each level 10 times in order to mitigate IV-2 216 0 157 162
the chance factor. The agents were evaluated by the number IV-3 325 100 0 60
of runs they could complete out of the ten, average collected IV-4 300 280 130 389
diamonds, average completion time and score. The score of Total 4078 670 1818 2591
a run is obtained using the Equation 14, the same used in
official Geometry Friends AI competitions.
this level it can be noticed a lack of path planning. The group
maxT ime − agentT ime IV of levels required a good amount of control. Although
Score = VCompleted × + the agent was not able to consistently complete all levels, it
maxT ime (14)
showed promising control in unfamiliar environments.
+ VCollected × NCollected
The supervised learning agent showcased the following prob-
lems:
• No sense of reward - The agent has no sense of reward. It
just mimics the demonstrator actions given a certain state.
This leads to sometimes passing out easily catchable dia-
monds.

• Upper Learning Limit - Because the agent learned from


human game play, the agent performs at best as good as
the human that generated the data.
With expectations to improve the supervised learning agent
performance, the agent was submitted to reinforcement learn-
ing where it should gain a sense of reward and hone the ac-
Figure 13. Levels used to measure agents performance. Each row cor- tions learned from human game play. However the results
responds to a different group. observed shows that this improvement did not occur. The bad
results observed in the crafted reinforcement learning solu-
Table 3 shows the scores obtained in each level for each so- tions can be due to the following reasons:
lution. Each row has a section written in bold to highlight the
best score for that level. • Lack of data - Compared to other state-of-the-art rein-
forcement learning solutions, the agent required a lot of
The supervised agent got the best results out of all other data from games that are mostly simpler than Geome-
agents. However, an overspecialization can be noticed in the try Friends. Due to the game complexity and limited re-
levels where the agent trained on, performing considerably sources, such amount of data could not be produced and
worse in unfamiliar environments. On level II-1, the agent used.
struggles with the stairs. Stair levels require good control of
the circle character in order to climb them. Controlling the
character across such level requires precision that maybe lost • Lack of training - It is possible that due to the lack of
when reducing the input frames to such a small size. Level more training iterations the agent’s were not able to prop-
II-2, despite the agent’s training is a level packed with very erly learn. In on-line learning, training is done as the agent
precision hard jumps. The difficulty showed to complete this is plays the game. The data is constantly renewed and pro-
level is due to trial and error on performing such jumps within duced. Batch reinforcement learning might lack in that re-
the time limit. In level III-2, the agent falls to catch the dia- gard. As the agent is training, no new data is produced,
mond below before catching the diamond on the platform. In resulting possibly in a much slower training.

8
• Catastrophic forgetting - Although Q-learning and policy Comparing to other solutions
gradient agents use the supervised learning policy to ex- Table 4 shows the results for the Reinforcement Learning so-
plore in the first iteration, these solutions do not have the lution by João Quitério [9] and the RRT agent [4]. In each
supervised learning information in their data during train- row, sections are highlighted if any agent outperformed the
ing. Although measures to prevent catastrophic forgetting best of the presented solutions on that level.
were applied, the data that could provide the algorithms
hints to the solution was not included. The growing batch Table 4. Results of the Supervised Learning Agent
method possibly had negative impact since failure data kept Levels RL RRT
accumulating and the agent had possibly too few examples I-1 274 281
of success. I-2 25 0
• Stability Issues - When supervised learning is performed, I-3 0 0
in the training phase the network is updated based on all I-4 334 100
four actions. The update is done based on all four network II-1 100 0
output nodes. In Reinforcement learning, the updated is II-2 100 100
based only on one output node, corresponding to the action II-3 130 220
performed. This updated can cause stability issues where II-4 100 0
one update on an action influences the values of the other III-1 317 260
actions. This issue can be observed in the Q-learning so- III-2 219 361
lution where given a certain state, all actions have similar III-3 130 0
Q-values although some of these pairs state-action were not III-4 378 411
present in the training data. These errors keep propagating IV-1 100 0
across the different training iterations. On-line learning IV-2 90 0
mitigates this effect because new data is generated while IV-3 252 300
training, exploring those given unexplored pair state-action IV-4 180 100
and ground them to realistic values. Total 2729 670
• Ungrounded variables - This problem is also related to
stability issues. The Q-values could have been updated Although the RL solution did not train on most simple levels,
based on pairs state-action that were possibly not explored. it shows good results on them. It severely struggles on harder
The network would have been updated based on the high- levels possibly due to controlling problems reported by the
est value of those ungrounded variables and propagated author.
them across the Q-function. Reinforcement Learning from Comparing the levels where the previous presented agents
demonstrations tries to solve this problem introducing a trained is unfair for this RL agent. Analyzing the levels that
margin function that forces other actions to be at least a are strange to all solutions (group III and IV), the supervised
margin lower than the demonstrators actions. learning agent has a total score of 1524, while this RL solu-
• Bad rewards - It is possible that the rewards chosen for tion wins with a total score of 1666.
each state are not good for this problem. Perhaps a better Most of the bad results reported or the RRT agent are due to
modeling for the reward function could results in a better the fact that the agent ends up crashing at the very beginning
learning. of the level. Since this solution depends on creating nodes
• Complex Network - It is possible that the network might and finding a path, the crash is probably related to the agent
be too complex to learn Q-values quickly. Although the not being able to find a proper path to resolve the level. This
network was tested on supervised learning, the different solution struggles in levels that require jumps.
parameters were not tested on the reinforcement learning From these results we can notice that dividing a problem
solutions. Better learning could have resulted from a sim- into several subproblems and using path finding techniques to
pler and faster network. choose where to go on each level can reduce overspecializa-
The policy gradient agent outperformed the q-learning agent. tion and help complete levels that require an order by which
This might be due to the fact that the policy gradient agent the diamonds need to be collected, like level III-2 and III-4,
started with an already trained network. where most deep learning solutions struggle.

As expected, the demonstrations helped the agent learn faster


and performed better when compared to the Q-learning solu- CONCLUSION
tion. But even with the demonstrations help, the agent still Geometry Friends is a difficult game and interesting study
did not perform as good as expected, possibly requiring more case for AI. Although there are many developed solutions
training. The unsatisfying results from the demonstration al- for this game, none was able to complete several Geometry
gorithm can also be due to the fact that the algorithm was Friends levels consistently.
constructed having on-line training in mind and because the The tests performed show that a simple supervised learning
n-step Q-learning loss function was not used alongside the strategy is able to complete many Geometry Friends with lit-
other three as the article suggested. tle amount of data and training. The tests also show that deep

9
reinforcement learning strategies are difficult to apply possi- support.
bly due to the game’s complexity. Reinforcement learning
still has the potential to surpass the supervised learning solu- REFERENCES
tion, therefore, all proposed solutions still have a lot room for 1. Chen, Z., and Yi, D. The Game Imitation : A Portable
improvement. Deep Learning Model for Modern Gaming AI. Course
Project Reports: Winter 2016 (CS231n) (2016).
Deep learning solutions are hard to develop since it requires a
lot of understanding of the tools used, a lot of experimentation 2. Hasselt, H. V., Guez, A., and Silver, D. Deep
of many different training parameters, manage big amounts Reinforcement Learning with Double Q-learning. AAAI
of data and a lot of waiting time until results are obtained. Conference on Artificial Intelligence (AAAI) (2016).
Networks also work as black boxes. It is hard to understand
3. Hester, T., Pietquin, O., Horgan, D., Lanctot, M., Schaul,
what is happening inside the network, making them hard to
T., Quan, J., Osband, I., and Dulac-arnold, G. Learning
debug and understand what could be going wrong. from Demonstrations for Real World Reinforcement
Compared to previously crafted Geometry Friends solutions, Learning. arXiv preprint arXiv:1704.03732v3 (2017).
the crafted solutions suffer from overspecialization on the 4. Leal, F., and Soares, R. RRT Agents Technical Report.
trained levels. Although they do not complete unknown lev- 2015.
els consistently, these solutions are still able to recognize fea-
tures from levels they trained on. This small generalization 5. Microsoft Cognitive Toolkit. https:
is not enough since they still precision in control and a path //www.microsoft.com/en-us/cognitive-toolkit/.
planning strategy. 6. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A.,
The main strengths of a deep learning solution is the character Veness, J., Bellemare, M. G., Graves, A., Riedmiller,
control across different Geometry Friends level. Since it is M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie,
not ruled based and have a better perception of the levels state. C., Sadik, A., Antonoglou, I., King, H., Kumaran, D.,
the agent adapts its velocity and jump accordingly. Wierstra, D., Legg, S., and Hassabis, D. Human-level
control through deep reinforcement learning. Nature
In conclusion, the supervised learning agent developed is a (2015), 529–533.
good contribution for the Geometry Friends problem. The
presented solution is very simple and practical since it re- 7. Mnih, V., Mirza, M., Graves, A., Harley, T., Lillicrap,
quires few data and training, it works on a rather simple CPU T. P., and Silver, D. Asynchronous Methods for Deep
and does not apply any higher planning. Reinforcement Learning. Inter- national Conference on
Machine Learning (2016), 1928–1937.
ACKNOWLEDGMENTS
8. Piot, B., Geist, M., and Pietquin, O. Boosted Bellman
I would like to thank my supervisors, professor Francisco Residual Minimization Handling Expert
Melo and professor Joo Dias, for their guidance, support and Demonstrations. Europe Conference on Machine
feedback throughout the last year. Learning (ECML) (2014), 549–564.
Special thanks to professor Rui Prada and professor Manuel 9. Quitério, J. a. A Reinforcement learning approach for
Lopes for their guidance. Although they were not my for- the circle agent of Geometry Friends, June 2015.
mal supervisors, their input and guidance while working with
Geometry Friends was very valuable. 10. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre,
L., van den Driessche, G., Schrittwieser, J., Antonoglou,
I also would like to thank my colleagues and other profes- I., Panneershelvam, V., Lanctot, M., Dieleman, S.,
sors I was lucky to spend my time with. Their availability to Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I.,
discuss ideas and help finding other possible solutions were Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T.,
important to me. and Hassabis, D. Mastering the game of Go with deep
Finally, I would like to show my gratitude towards my fam- neural networks and tree search. Nature (2016),
ily and friends for their caring, understanding and constant 484–489.

10

You might also like