Go Blend Behavior and Affect
Go Blend Behavior and Affect
Abstract—This paper proposes a paradigm shift for affective to reliably interweave behavior and affect without necessarily
computing by viewing the affect modeling task as a reinforcement relying on affect corpora of massive sizes.
learning process. According to our proposed framework the The proposed concept revolutionizes affective computing,
context (environment) and the actions of an agent define the
common representation that interweaves behavior and affect. which traditionally attempts to model human affect in the
To realise this framework we build on recent advances in context of an interaction but largely ignores the affective
reinforcement learning and use a modified version of the Go- response to the actions of the involved (inter-)actors. Both
Explore algorithm which has showcased supreme performance behavior and affect are blended in an internalised model that
in hard exploration tasks. In this initial study, we test our associates an agent’s context (environment) and its actions
framework in an arcade game by training Go-Explore agents to
both play optimally and attempt to mimic human demonstrations to both its behavioral performance and its affective state.
of arousal. We vary the degree of importance between optimal At the same time, we introduce a novel paradigm for RL
play and arousal imitation and create agents that can effectively where the rewards are not only tied to a user’s behavior but
display a palette of affect and behavioral patterns. Our Go- combined with rewards from annotations provided by the users
Explore implementation not only introduces a new paradigm for themselves (i.e. human affect demonstrations). According to
affect modeling; it empowers believable AI-based game testing
by providing agents that can blend and express a multitude of our approach, both behavior and affect can form reward
behavioral and affective patterns. functions that can be experienced from RL agents that learn
Index Terms—Reinforcement Learning, Go-Explore, Arousal, to behave and express affect in various ways. The proposed
Affective Computing, Artificial Agents, Gameplaying Go-Explore implementation is tested in a simple arcade game
featuring a rich corpus of self-reported traces of arousal.
I. I NTRODUCTION Our key findings suggest that agents can be trained effec-
Affective computing is traditionally viewed from an expert- tively to behave in particular ways (e.g. play optimally with
domain and supervised learning lens through which manifesta- super-human performance) but also behave so as they feel as
tions of affect are linked to ground truth labels of affect that are humans would in a particular game state. Beyond the proposed
provided by humans. Behavior and affect are either blended in paradigm shift in affective computing, our Go-Explore agents
the form of hand-crafted rules [1], [2] or machine learned via offer insights on the relationship between affect and behavior
supervised learning methods [3]. While affect models designed through their RL trained models. Importantly, RL agents that
or built this way are linked to the context of the interaction, blend behavior and affect can be used directly for believable
they are often completely independent of the behavior of the testing as such agents can simulate and express simultaneously
involved actors. both behavioral and affective patterns of humans.
A recent (non-deep) reinforcement learning (RL) algo- II. BACKGROUND
rithm, Go-Explore [4], showcased superb performance at hard
This section provides a brief overview of the related do-
exploration problems with many states—such as complex
mains of reinforcement learning, the Go-Explore algorithm,
planning-based games—that most other deep learning methods
traditional affect modelling via imitation learning and affect
struggled with. In its application to the game Montezuma’s
modelling using reinforcement learning.
Revenge (Parker Brothers, 1984), Go-Explore reached super-
human gameplaying performance. In part, this is achieved A. Reinforcement Learning and Go-Explore
by storing all visited game states and exploring from such
Reinforcement learning approaches machine learning tasks
interim states rather than playing the game from the start [5].
from the perspective of behavioral psychology, mimicking the
Inspired by these recent breakthroughs in RL, we leverage the
way animals and humans learn through receiving positive or
capacity of Go-explore to introduce a paradigm shift for affect
negative rewards for their actions [6]. Exploring state spaces
modeling. We argue that viewing affect modeling as an RL
with sparse and/or deceptive rewards has been a core challenge
process yields agents (or computational actors) that manage
for traditional RL algorithms, as they suffer from issues
This project has received funding from the European Union’s Horizon 2020 of detachment and derailment. Detachment occurs when an
programme under grant agreement No 951911. algorithm forgets how to return to previously visited promising
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
areas of the search space due to exploration in other areas. paradigm for such generators. As a response, a recent study
Derailment is a consequence of RL algorithms which do not blended the frameworks of experience-driven PCG and PCG
separate returning to states from exploring the search space. via reinforcement learning, namely ED(PCG)RL; EDRL in
This may result in potentially promising states that require short [25] focuses on the use of RL for the algorithmic creation
a long sequence of precise actions unlikely to occur under of content according to a surrogate model of player experience
exploratory conditions. or affect.
Go-Explore is a recent algorithm in the RL family [7] Whilst there exist a variety of studies on the topic of
which is explicitly designed to overcome the two aforemen- agent emotion and reinforcement learning, literature on using
tioned challenges. The algorithm was introduced with the aim human-annotated emotion as a training signal for learning
of improving RL performance in hard-exploration problems, is limited [26]. It has been shown that coupling an agent’s
which tend to contain sparse or deceptive rewards. Go-Explore simulated affect with its action-selection mechanism allows it
has demonstrated previously unmatched performance in Atari to find its goal faster and avoid premature convergence to local
games [5], highlighting its ability to thoroughly explore com- optima [27]. Similarly, [28] showed that using affect as a form
plex and challenging environments. In games with sparse of social referencing is a simple method for teaching a robot
rewards (such as Montezuma’s Revenge), a large number tasks, such as obstacle avoidance and object reaching. Work
of actions must be taken before a reward can be obtained, on intrinsic motivation through the RL paradigm [29], [30] is
whereas deceptive rewards may mislead the agent and result also highly relevant to our aims. Intrinsic motivation studies by
in premature convergence and therefore poor performance [8]. definition, however, ignore human demonstrations, behavioral
Go-Explore has been used for text-based games, capable of and importantly affective [31]. A number of very recent studies
outperforming traditional agents in Zork1 [9] and is able to (e.g. [32]) view the intrinsic motivation paradigm from an
generalize to unseen text-based games more effectively [10]. inverse RL lens through which reward functions are inferred
The algorithm’s capabilities have also been demonstrated in from behavioral demonstrations.
complex maze navigation tasks which could not be completed The work in this paper expands upon the current state of
by traditional RL agents [11]. Beyond playing planning- the art by viewing affect modeling as an RL paradigm and
based games with superhuman performance, Go-Explore has explicitly blending agent behavior and affect using a cutting
been used for autonomous vehicle control for adaptive stress edge RL algorithm for hard exploration problems. The result
testing [12], and as a mixed-initiative tool for quality-assurance is a set of agents which are tested in games in this initial study.
testing using automated exploration [13]. While Go-Explore The game agents trained to behave (i.e. play) optimally, even
has proved to be a highly effective algorithm for behaviour better than humans, and “feel” like a human would (via arousal
policy search, it has never been tested on affect modeling tasks. imitation), or a blend of the two approaches with varying
This proof-of-concept paper introduces the first application degrees of importance.
of the algorithm for modeling affect as an RL process and
blending it with behavior within a game agent. III. B LENDING B EHAVIOUR AND A FFECT
This paper proposes combining rewards for good behavioral
B. Reinforcement Learning and Affective Computing performance with rewards for affect matching in a rein-
Traditionally, affect modelling [3] involves constructing a forcement learning agent. We leverage the Go-Explore RL
computational model of affect that takes as input the context algorithm and describe our implementation in Section III-A
of the interaction, such as pixels [14], [15], and multimodal and how it is enriched with affect information in Sections
information about a user—including physiological signals III-B and III-C.
[16], facial expressions [17], [18] or speech [19]—and outputs
a predicted corresponding emotional state (i.e. the ground A. Go-Explore Implementation
truth of emotion). Given that affective computing relies on The Go-Explore algorithm builds on two phases to create
a provided ground truth of emotion that is human-annotated, a robust search policy that performs well under a specified
affect detection is naturally viewed as a supervised learning reward scheme received from the environment. The first phase
task [3]. Traditionally a dataset of user state-affect pairs is used is the exploration phase, where a deterministic model of the
to train a model to predict affect [20]. Trained affect models environment is used to explore the search space thoroughly.
are then used in conjunction with action selection methods During exploration an archive of the states encountered so far
for the synthesis, adaptation and affect-based expression of is used to ensure states are not forgotten, thus preventing the
agents including virtual humans [21] and social believable issue of derailment. Each state in the archive also contains the
agents [22]. string of actions needed to return to it, addressing the issue of
Beyond the obvious uses of RL for learning a behavior detachment and ensuring that all states can be visited. States
policy, RL has been used as a paradigm for creative AI are chosen using a selection strategy (e.g. randomly or through
and, in particular, for the procedural generation of content the UCB formula [33]), after which the algorithm returns to
(PCG) [23]. While the experience-driven PCG framework [24] the state as described and begins exploring from there. At
considered the use of affect models beyond the behavior action its simplest, exploration occurs by taking random actions and
space, its initial version never considered RL as a training updating the cell archive with new states or updating existing
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
repeated several times, moving the starting point further back
until the beginning of the trajectory is reached and the agent
has been trained on the entire trajectory. To stabilize learning,
Go-Explore extends this method to use multiple trajectories
which are uniformly sampled at the beginning of each learning
episode.
Our implementation of Go-Explore follows the original
approach by Ecoffet et al. [5] (see Fig. 1). An archive of
cells stores the game states that have been visited, with each
cell representing a unique game state and containing the
instructions needed to reach that point in the game. Each cell
has an associated reward value, which is used to determine if
the cell should be updated in case a similar state with a better
score is found. Cells are chosen to explore from randomly,
and the actions taken to build trajectories during exploration
are also random. Along with the action trajectory to return
to its state, each cell also contains trajectories for the state
with accompanying cumulative behavioral and affect rewards
per trajectory. This implementation of Go-Explore differs from
the original version through the inclusion of affect (i.e. arousal
in this study) in the reward function. Moreover, in this paper
the robustification phase of Go-Explore is not carried out but
will be explored in future work.
B. Arousal Model
A natural question arising when one is asked to blend
behavior and affect within a learning process is how the two
pieces of information will be considered and fused. An obvious
requirement is that the human annotations of affect are time-
continuous, thus providing moment-to-moment information
about the change of affective states and aligning them with
game states stored in playtraces.
One approach for calculating an affect reward would be to
build a priori models of affect using supervised learning and
use their predicted outcomes indirectly as affect-based reward
functions. Instead, one could use the affect labels directly
and build reward functions based on this information. Rather
than relying on a trained surrogate model of arousal in a
given state, our algorithm queries a dataset of human arousal
demonstrations to find the arousal value of the human player
Fig. 1. A high-level overview of Go-Explore that blends agent behavior and closest to the current game state. We use the playtraces and
affect.
their associated arousal traces directly to assess the player’s
arousal value in that state which, in turn, provides the intended
ones with better reward values. The move selection strategy arousal goal at this point in time.
can be improved according to the nature of the environment C. Reward Function
being searched and through the use of expert knowledge.
The reward function used for this version of Go-Explore
The result of the exploration phase is a number of high per-
consists of two weighted functions for optimizing behavior
forming trajectories using the deterministic model. If required,
and imitating human affect respectively. Both components are
the robustification phase uses the “backward algorithm” [34]
normalized within the range [0, 1] to avoid uneven weighting
to train an agent to perform at the same level (or better)
between the two objectives. In particular the reward function
as the trajectories found in exploration, but in a stochastic
used, Rλ , is as follows:
setting. The backward algorithm is an RL technique used to
learn from a given trajectory by decomposing the problem
Rλ = λ · Ra + (1 − λ) · Rb (1)
into smaller exploration tasks. It starts by placing the agent
near the end of the trajectory and uses an off-the-shelf RL where Ra and Rb are the rewards associated with affect and
algorithm to train the agent to imitate its last segment. This is behavior, respectively, and λ is a weighting parameter that
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
blends the two rewards. Formally, the reward associated to
affect (i.e. arousal in this paper) is computed as follows:
n
1X
Ra = (1 − |h(i) − a(i)|) (2)
n i=0
where i is a playtrace and affect annotation observation within
a time-window; n is the number of observations made so
far in this trajectory; h(i) is the agent’s estimated arousal
value in its current game state; a(i) is the arousal goal at
this point in the game. In this paper, we derive h(i) and a(i)
directly from human playtraces and their accompanied affect
annotations (see Section III-B). Specifically we calculate a(i)
by first creating a mean arousal trace averaging all players’
arousal values in the same timestamp: this creates a moment-
to-moment arousal trace that captures the consensus of players
(regardless of actual game context). a(i) is then calculated by
finding the arousal value of this mean arousal trace for that Fig. 2. Endless Runner Game Layout
time window i. On the other hand, h(i) is based on the agent’s
current game state, finding the annotated arousal value of a
human playtrace at any timestamp which has an accompanying by moving up or down (via keyboard input) and/or using a
game state closest to the agent’s game state. melee attack described below. Game objects are placed on
Ra minimizes the absolute difference between the arousal one of two lanes (upper/lower) and are spawned at random
value of a human player in a similar game state as the agent, intervals. Game objects include items that the player may
and the mean annotated arousal value at this time window collide with to improve their score (coins) or alter their
i. Since this difference is averaged across the number of movement speed (potions). Other game objects are obstacles
observations made so far, it encourages trajectories with high (which include immobile enemies); the player must use their
imitation accuracy across the whole arousal trace generated. melee attack when in close proximity to the obstacle in order
The reward for the agent’s behavior (Rb ) depends on the to clear it. Colliding with an obstacle results in a 10 point score
game; in this paper we assume that the total score accumulated penalty, destroys any nearby game objects on the screen, and
throughout the game is a sufficient reward for optimal behav- resets the player’s speed to the default value. Every 3 seconds
ior. This assumes that the environment follows arcade game the player is passively awarded a point to their game score
tropes which are played for high-score, as is the case in our on top of any bonus points they may receive for collecting
case study described in Section IV. In more complex games, coins. Every 10 seconds the speed of the player increases
or in games without an explicit score, the reward signal must by a fixed amount, increasing the difficulty of the game. In
be designed on an ad-hoc basis such as the reward function theory, the game can be played for as long as the player wants.
used in the original implementation of Go-Explore [5]. During data collection for the AGAIN dataset, an Endless
According to Eq. (1), if λ = 0 the reward function trains the session ended after exactly two minutes and the player has
agent to only maximise its score (i.e. optimize its behavior) infinite lives. We follow the same duration in all experiments
and ignore its associated arousal trace. On the other hand, if in this paper in order to leverage players’ affect annotations
λ = 1 the agent is trained to imitate human arousal and ignore and compare the agents’ performance with human play.
its behavior. B. Go-Explore for Endless Runner
IV. C ASE S TUDY: E NDLESS RUNNER The game was converted into a deterministic environment to
The proposed vision of blending arousal and performance be compatible with the exploration phase of Go-Explore. The
rewards is tested in the “Endless Runner” game (hereafter sequence of objects to be spawned and their spawn times was
Endless). Endless is a platformer game built using the Unity fixed to ensure the same sequence of game states are observed
Engine and featured in the AGAIN dataset [35]. The game when replaying trajectories. Moreover, the game could start
was chosen for its simple mechanics and objective, and for its from any saved snapshot (i.e. any visited game state). This
accompanying dataset of 112 annotated human play sessions minimizes the time spent returning to a new cell’s state and
that can be easily used for the arousal model. allows the algorithm to focus on exploration, an approach
central to the Go-Explore paradigm [5]. To decide which
A. Game Description cell the game state should be assigned to, the game state
In Endless, the player controls an avatar that constantly is mapped as an 8-parameter vector describing the player’s
moves towards the right and must avoid or destroy obstacles current lane (two binary values, one per lane), and which
that spawn in their path. The platform consists of two lanes game objects are on each lane at specific distance bands (near,
(top/bottom) and the player’s only controls is switching lanes mid-distance, and far). The possible values for these bands
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
are empty, item, or obstacle, and in case items and obstacles TABLE I
exist in the same band, it is treated as an obstacle band. The R ESULTS FOR E NDLESS AVERAGED FROM 5 RUNS AND INCLUDING THE
95% CONFIDENCE INTERVALS .
reward for optimal behavior (Rb ) in Endless is the player’s
total score after an action is taken. This value is normalized Experiment Performance Measures
between 0 and 1 with respect to the optimal score achievable Setup Rb Ra Rλ
R0.0 0.79 (±0.0474) 0.72 (±0.0126) 0.79 (±0.0474)
in the play session. The optimal in-game score is calculated R0.25 0.73 (±0.0818) 0.73 (±0.0181) 0.73 (±0.0569)
by summing two components. The first is the total amount R0.5 0.74 (±0.0741) 0.74 (±0.0145) 0.74 (±0.0311)
of points awarded to the player passively over time for not R0.75 0.69 (±0.0658) 0.76 (±0.0147) 0.74 (±0.0082)
R1.0 0.25 (±0.1335) 0.79 (±0.0056) 0.79 (±0.0056)
dying during the game. The second is the maximum amount
Random 0.03 (±0.1012) 0.75 (±0.0074) N/A
of bonus points achievable by picking up every coin in the Human 0.70 (±0.0467) 0.77 (±0.0131) N/A
deterministic environment.
C. Experimental Protocol
Reported results per method are averaged across five inde- imitation tend to converge to their maximum value quicker due
pendent runs of the Go-Explore algorithm. Each run consists to the nature of the arousal reward function. Since at R0.0 the
of the exploration phase of Go-Explore (there is no robusti- total reward amounts to a normalized measure of the agent’s
fication phase in this first experiment), and the agent returns in-game score, it is not surprising that high scores are only
and explores 4,000 times before selecting the best trajectory attainable at late points in the game. Instead, states that match
and saving it. The agent explores a maximum of 20 actions the mean arousal trace seem to be easily discovered even early
before choosing a new state to explore from. The actions taken in the game.
during exploration are chosen at random among the 6 possible Looking at the results for the behavioral component (i.e.
options (move up or down, move up or down and attack, no the total game score normalized to the absolute best possible
action and attack). The new state to explore from is chosen score), the random agent shows the worst performance as one
at random among those already discovered: the reward of the would expect when playing most games. While the exploration
state in the archive, or the number of times it has been visited is phase of Go-Explore relies on a random sequence of actions,
not considered. The best trajectories are saved and can be used the discovery of interim states (cells) to explore from and
for the robustification phase of Go-Explore in future work. the optimization of these states based on Rλ clearly leads
The λ parameter of Eq. 1 was varied to observe the to a more efficient playstyle than random. For R1.0 , the
relationship between learning to play the game optimally and agent still manages to produce a better score than the random
learning to imitate human annotated arousal. Table I shows the agent but remains significantly lower than the average human
five values used for the λ parameter, ranging from 0 to 1 in player. Random and R1.0 also display a wider confidence
increments of 0.25. Recall that at λ = 0 and λ = 1 the agent interval compared to the rest of the experiments, pointing to an
tries to learn to solely behave optimally or to solely “feel” inconsistent behavior. When the behavior component is intro-
like a human respectively. As a baseline, an experiment with duced with a small weight (e.g. R0.75 ), the score immediately
an agent that performed random actions was carried out and matches that of the average human demonstration. As λ is
results are averaged from 5 independent runs. To estimate this lowered to zero, the agent’s score improves and surpasses
random agent’s arousal levels, a trace was generated based human levels of performance. Figure 3a illustrates how the
on the game states visited using the same approach as in the agents’ cumulative behavior reward changes over time for each
Go-Explore experiments. configuration. As noted above, the cumulative behavior reward
The results were compared to the average performance seen is very time-dependent by design (players reach higher scores
by humans in the dataset for both behavior and arousal reward the longer they play) but clearly the random agent (and R1.0
functions. All results given are the average observed across the to a degree) tends to lose score by hitting obstacles which
5 runs of Go-Explore, paired with the 95% confidence interval. seems to perfectly offset passive score gains.
The results for the arousal component tell a similar story
D. Results to the results for behavior, with the exception of the random
Table I shows the final values observed for the cumulative agent. Unsurprisingly, the arousal score increases as λ in-
behavior (Rb ) and arousal (Ra ) components, as well as the creases from 0 to 1. What is surprising however is the arousal
overall reward function (Rλ ) for each experiment. Note that scores attained by the random agent, which seem to be almost
the baseline agent and human entries are not included in the at the same levels as the human trace and is only significantly
Rλ column as they were not trained using Go-Explore. Figure surpassed by R1.0 . The potential reasons for this are discussed
3c illustrates how the agents’ overall cumulative reward fluc- in section V. Figure 3b illustrates how the agents’ cumulative
tuates over time for each Go-Explore configuration. Note that arousal reward changes over time for each configuration. It is
due to different λ values, the Rλ values across experiments evident that unlike Rb which is tied to the game score, it is
are not comparable but the differences in how it fluctuates easy to attain high values in Rb early on, and it is also easy
over time provides insight into the behavior of the algorithm. to maintain the same levels throughout the game even when
It is clear that agents with higher priority assigned to arousal performing random actions.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
V. D ISCUSSION
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
approach for deriving an arousal value for a given state with [4] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley,
this limitation in mind would help generate more diverse traces and Jeff Clune, “Montezuma’s revenge solved by go-explore, a new
algorithm for hard-exploration problems (sets records on pitfall, too),”
and allow the differences in the reward functions to become Uber Engineering Blog, 2018.
more pronounced. A more complex game where the agent has [5] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and
more degrees of freedom and more arousing stimuli for the Jeff Clune, “First return, then explore,” Nature, vol. 590, no. 7847, pp.
580–586, 2021.
human playtesters will also likely illuminate the strengths and [6] Richard S Sutton and Andrew G Barto, Reinforcement learning: An
weaknesses of this approach. introduction, MIT press, 2018.
This proof of concept opens up several avenues for future [7] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and
Jeff Clune, “Go-explore: a new approach for hard-exploration problems,”
work to further explore the relationship between behavior and arXiv preprint arXiv:1901.10995, 2019.
affect in the context of reinforcement learning. Obvious next [8] Georgios N. Yannakakis and Julian Togelius, Artificial Intelligence and
steps have been highlighted above in terms of refining the Games, Springer, 2018, http://gameaibook.org.
arousal reward function and testing the approach in more [9] Prithviraj Ammanabrolu, Ethan Tien, Zhaochen Luo, and Mark O. Riedl,
“How to avoid being eaten by a grue: Exploration strategies for text-
complex, more stimulating games. Another direction is testing adventure agents,” arXiv preprint arXiv:2002.08795, 2020.
machine-learned predictors of affective states rather than the [10] Andrea Madotto, Mahdi Namazifar, Joost Huizinga, Piero Molino,
direct mapping to the closest human trace performed currently. Adrien Ecoffet, Huaixiu Zheng, Alexandros Papangelis, Dian Yu, Chan-
dra Khatri, and Gokhan Tur, “Exploration based language learning for
While surrogate models are often inexact, it may counteract text-based games,” arXiv preprint arXiv:2001.08868, 2020.
the sparse game states encountered by human players when [11] Guillaume Matheron, Nicolas Perrin, and Olivier Sigaud, “PBCS:
matching an unseen state. More importantly, incorporating Efficient exploration and exploitation using a synergy between reinforce-
ment learning and motion planning,” in Proceedings of the International
the robustification phase in the Go-Explore algorithm is ex- Conference on Artificial Neural Networks. Springer, 2020, pp. 295–307.
pected to lead to new insights on the impact of affect-based [12] Mark Koren and Mykel J. Kochenderfer, “Adaptive stress testing without
rewards, especially since the environment will no longer be domain heuristics using go-explore,” arXiv preprint arXiv:2004.04292,
2020.
deterministic and thus many more game states are likely to be [13] Kenneth Chang, Batu Aytemiz, and Adam M Smith, “Reveal-more:
visited. Finally, imitating human behavior (as a form of reward Amplifying human effort in quality assurance testing using automated
function) can reveal interesting new relationships between the exploration,” in Proceedingos of the IEEE Conference on Games, 2019.
human-like behavior and affect and optimal play; such derived [14] Konstantinos Makantasis, Antonios Liapis, and Georgios N. Yannakakis,
“From pixels to affect: A study on games and player experience,” in
policies would likely allow agents to play (near) optimally, Proceedings of the International Conference on Affective Computing and
whilst attempting to imitate both human behavior and human Intelligent Interaction, 2019.
affect. [15] Konstantinos Makantasis, Antonios Liapis, and Georgios N Yannakakis,
“The pixels and sounds of emotion: General-purpose representations of
arousal in games,” IEEE Transactions on Affective Computing, 2021.
VI. C ONCLUSION [16] Hector P Martinez, Yoshua Bengio, and Georgios N Yannakakis,
“Learning deep physiological models of affect,” IEEE Computational
This paper presents a proof of concept implementation of a intelligence magazine, vol. 8, no. 2, pp. 20–33, 2013.
new reinforcement learning paradigm for affective computing [17] Adria Ruiz, Ognjen Rudovic, Xavier Binefa, and M. Pantic, “Multi-
instance dynamic ordinal random fields for weakly supervised facial
where behavioral and affective goals are interwoven. We behavior analysis,” IEEE Transactions on Image Processing, vol. 27,
leverage the Go-Explore algorithm due to its cutting edge pp. 3969–3982, 2018.
ability to solve hard exploration problems, and we pair it [18] R. Walecki, Ognjen Rudovic, V. Pavlovic, B. Schuller, and M. Pantic,
“Deep structured learning for facial action unit intensity estimation,” in
with a set of reward functions that blend optimal behavior Proceedings of the IEEE Conference on Computer Vision and Pattern
with arousal imitation to different degrees. Using the Endless Recognition, 2017, pp. 5709–5718.
Runner game as a platform to test the implementation, we [19] George Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, Mihalis A.
were able to make use of an extensive dataset of human Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-
end speech emotion recognition using a deep convolutional recurrent
play sessions and accompanying arousal demonstrations that network,” in Proceedings of the IEEE International Conference on
guided the agent’s policy. While this initial study focused on Acoustics, Speech and Signal Processing, 2016, pp. 5200–5204.
a single, simple game, the next steps of our investigations [20] Sander Koelstra, C. Mühl, M. Soleymani, Jong-Seok Lee, A. Yazdani,
T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras, “Deap: A database for
include the enhancement of the Go-Explore approach to cater emotion analysis ;using physiological signals,” IEEE Transactions on
for its robustification phase, the introduction of ordinal reward Affective Computing, vol. 3, pp. 18–31, 2012.
functions, and the extension of the approach to accommodate [21] William R Swartout, Jonathan Gratch, Randall W Hill Jr, Eduard Hovy,
Stacy Marsella, Jeff Rickel, and David Traum, “Toward virtual humans,”
more complex environments within and beyond games. AI Magazine, vol. 27, no. 2, pp. 96–96, 2006.
[22] W Scott Reilly, “Believable social and emotional agents.,” Tech. Rep.,
R EFERENCES Carnegie-Mellon Univ Pittsburgh pa Dept of Computer Science, 1996.
[23] Ahmed Khalifa, Philip Bontrager, Sam Earle, and Julian Togelius,
[1] Stacy Marsella, Jonathan Gratch, Paolo Petta, et al., “Computational “Pcgrl: Procedural content generation via reinforcement learning,” arXiv
models of emotion,” A Blueprint for Affective Computing-A sourcebook preprint arXiv:2001.09212, 2020.
and manual, vol. 11, no. 1, pp. 21–46, 2010. [24] Georgios N Yannakakis and Julian Togelius, “Experience-driven pro-
[2] Stacy C Marsella and Jonathan Gratch, “Ema: A process model of cedural content generation,” in Proceedings of the International Con-
appraisal dynamics,” Cognitive Systems Research, vol. 10, no. 1, pp. ference on Affective Computing and Intelligent Interaction. IEEE, 2015,
70–90, 2009. pp. 519–525.
[3] Rafael A Calvo and Sidney D’Mello, “Affect detection: An interdis- [25] Tianye Shu, Jialin Liu, and Georgios N Yannakakis, “Experience-
ciplinary review of models, methods, and their applications,” IEEE driven PCG via reinforcement learning: A Super Mario Bros study,”
Transactions on Affective Computing, vol. 1, no. 1, pp. 18–37, 2010. in Proceedings of the IEEE Conference on Games, 2021.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
[26] Thomas M Moerland, Joost Broekens, and Catholijn M Jonker, “Emo-
tion in reinforcement learning agents and robots: a survey,” Machine
Learning, vol. 107, no. 2, pp. 443–480, 2018.
[27] Joost Broekens, Walter A Kosters, and Fons J Verbeek, “On affect
and self-adaptation: Potential benefits of valence-controlled action-
selection,” in International Work-Conference on the Interplay Between
Natural and Artificial Computation. Springer, 2007, pp. 357–366.
[28] Cyril Hasson, Philippe Gaussier, and Sofiane Boucenna, “Emotions
as a dynamical system: the interplay between the meta-control and
communication function of emotions,” Paladyn, vol. 2, no. 3, pp. 111–
125, 2011.
[29] Satinder Singh, Andrew G Barto, and Nuttapong Chentanez, “Intrin-
sically motivated reinforcement learning,” Tech. Rep., Massachusetts
University, Amherst Department of Computer Science, 2005.
[30] Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan
Sorg, “Intrinsically motivated reinforcement learning: An evolutionary
perspective,” IEEE Transactions on Autonomous Mental Development,
vol. 2, no. 2, pp. 70–82, 2010.
[31] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre,
Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas, “Social
influence as intrinsic motivation for multi-agent deep reinforcement
learning,” in Proceedings of the International Conference on Machine
Learning. PMLR, 2019, pp. 3040–3049.
[32] Léonard Hussenot, Robert Dadashi, Matthieu Geist, and Olivier Pietquin,
“Show me the way: Intrinsic motivation from demonstrations,” arXiv
preprint arXiv:2006.12917, 2020.
[33] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M
Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego
Perez, Spyridon Samothrakis, and Simon Colton, “A survey of monte
carlo tree search methods,” IEEE Transactions on Computational
Intelligence and AI in games, vol. 4, no. 1, pp. 1–43, 2012.
[34] Tim Salimans and Richard Chen, “Learning montezuma’s revenge from
a single demonstration,” arXiv preprint arXiv:1812.03381, 2018.
[35] David Melhart, Antonios Liapis, and Georgios N. Yannakakis,
“The Affect Game AnnotatIoN (AGAIN) dataset,” arXiv preprint
arXiv:2104.02643, 2021.
[36] Georgios N Yannakakis, Roddy Cowie, and Carlos Busso, “The ordinal
nature of emotions,” in Proceedings of the International Conference on
Affective Computing and Intelligent Interaction. IEEE, 2017, pp. 248–
255.
[37] Georgios N Yannakakis, Roddy Cowie, and Carlos Busso, “The ordinal
nature of emotions: An emerging approach,” IEEE Transactions on
Affective Computing, vol. 12, no. 1, pp. 16–35, 2018.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.