Go Blend Behavior and Affect

Uploaded by

GomesGilzamir

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Go Blend Behavior and Affect

Uploaded by

GomesGilzamir

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

Go-Blend Behavior and Affect

Matthew Barthet Antonios Liapis Georgios N. Yannakakis

Institute of Digital Games Institute of Digital Games Institute of Digital Games
University of Malta University of Malta University of Malta
Msida, Malta Msida, Malta Msida, Malta
[email protected] [email protected] [email protected]

Abstract—This paper proposes a paradigm shift for affective to reliably interweave behavior and affect without necessarily
computing by viewing the affect modeling task as a reinforcement relying on affect corpora of massive sizes.
learning process. According to our proposed framework the The proposed concept revolutionizes affective computing,
context (environment) and the actions of an agent define the
common representation that interweaves behavior and affect. which traditionally attempts to model human affect in the
To realise this framework we build on recent advances in context of an interaction but largely ignores the affective
reinforcement learning and use a modified version of the Go- response to the actions of the involved (inter-)actors. Both
Explore algorithm which has showcased supreme performance behavior and affect are blended in an internalised model that
in hard exploration tasks. In this initial study, we test our associates an agent’s context (environment) and its actions
framework in an arcade game by training Go-Explore agents to
both play optimally and attempt to mimic human demonstrations to both its behavioral performance and its affective state.
of arousal. We vary the degree of importance between optimal At the same time, we introduce a novel paradigm for RL
play and arousal imitation and create agents that can effectively where the rewards are not only tied to a user’s behavior but
display a palette of affect and behavioral patterns. Our Go- combined with rewards from annotations provided by the users
Explore implementation not only introduces a new paradigm for themselves (i.e. human affect demonstrations). According to
affect modeling; it empowers believable AI-based game testing
by providing agents that can blend and express a multitude of our approach, both behavior and affect can form reward
behavioral and affective patterns. functions that can be experienced from RL agents that learn
Index Terms—Reinforcement Learning, Go-Explore, Arousal, to behave and express affect in various ways. The proposed
Affective Computing, Artificial Agents, Gameplaying Go-Explore implementation is tested in a simple arcade game
featuring a rich corpus of self-reported traces of arousal.
I. I NTRODUCTION Our key findings suggest that agents can be trained effec-
Affective computing is traditionally viewed from an expert- tively to behave in particular ways (e.g. play optimally with
domain and supervised learning lens through which manifesta- super-human performance) but also behave so as they feel as
tions of affect are linked to ground truth labels of affect that are humans would in a particular game state. Beyond the proposed
provided by humans. Behavior and affect are either blended in paradigm shift in affective computing, our Go-Explore agents
the form of hand-crafted rules [1], [2] or machine learned via offer insights on the relationship between affect and behavior
supervised learning methods [3]. While affect models designed through their RL trained models. Importantly, RL agents that
or built this way are linked to the context of the interaction, blend behavior and affect can be used directly for believable
they are often completely independent of the behavior of the testing as such agents can simulate and express simultaneously
involved actors. both behavioral and affective patterns of humans.
A recent (non-deep) reinforcement learning (RL) algo- II. BACKGROUND
rithm, Go-Explore [4], showcased superb performance at hard
This section provides a brief overview of the related do-
exploration problems with many states—such as complex
mains of reinforcement learning, the Go-Explore algorithm,
planning-based games—that most other deep learning methods
traditional affect modelling via imitation learning and affect
struggled with. In its application to the game Montezuma’s
modelling using reinforcement learning.
Revenge (Parker Brothers, 1984), Go-Explore reached super-
human gameplaying performance. In part, this is achieved A. Reinforcement Learning and Go-Explore
by storing all visited game states and exploring from such
Reinforcement learning approaches machine learning tasks
interim states rather than playing the game from the start [5].
from the perspective of behavioral psychology, mimicking the
Inspired by these recent breakthroughs in RL, we leverage the
way animals and humans learn through receiving positive or
capacity of Go-explore to introduce a paradigm shift for affect
negative rewards for their actions [6]. Exploring state spaces
modeling. We argue that viewing affect modeling as an RL
with sparse and/or deceptive rewards has been a core challenge
process yields agents (or computational actors) that manage
for traditional RL algorithms, as they suffer from issues
This project has received funding from the European Union’s Horizon 2020 of detachment and derailment. Detachment occurs when an
programme under grant agreement No 951911. algorithm forgets how to return to previously visited promising

978-1-6654-0021-3/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
areas of the search space due to exploration in other areas. paradigm for such generators. As a response, a recent study
Derailment is a consequence of RL algorithms which do not blended the frameworks of experience-driven PCG and PCG
separate returning to states from exploring the search space. via reinforcement learning, namely ED(PCG)RL; EDRL in
This may result in potentially promising states that require short [25] focuses on the use of RL for the algorithmic creation
a long sequence of precise actions unlikely to occur under of content according to a surrogate model of player experience
exploratory conditions. or affect.
Go-Explore is a recent algorithm in the RL family [7] Whilst there exist a variety of studies on the topic of
which is explicitly designed to overcome the two aforemen- agent emotion and reinforcement learning, literature on using
tioned challenges. The algorithm was introduced with the aim human-annotated emotion as a training signal for learning
of improving RL performance in hard-exploration problems, is limited [26]. It has been shown that coupling an agent’s
which tend to contain sparse or deceptive rewards. Go-Explore simulated affect with its action-selection mechanism allows it
has demonstrated previously unmatched performance in Atari to find its goal faster and avoid premature convergence to local
games [5], highlighting its ability to thoroughly explore com- optima [27]. Similarly, [28] showed that using affect as a form
plex and challenging environments. In games with sparse of social referencing is a simple method for teaching a robot
rewards (such as Montezuma’s Revenge), a large number tasks, such as obstacle avoidance and object reaching. Work
of actions must be taken before a reward can be obtained, on intrinsic motivation through the RL paradigm [29], [30] is
whereas deceptive rewards may mislead the agent and result also highly relevant to our aims. Intrinsic motivation studies by
in premature convergence and therefore poor performance [8]. definition, however, ignore human demonstrations, behavioral
Go-Explore has been used for text-based games, capable of and importantly affective [31]. A number of very recent studies
outperforming traditional agents in Zork1 [9] and is able to (e.g. [32]) view the intrinsic motivation paradigm from an
generalize to unseen text-based games more effectively [10]. inverse RL lens through which reward functions are inferred
The algorithm’s capabilities have also been demonstrated in from behavioral demonstrations.
complex maze navigation tasks which could not be completed The work in this paper expands upon the current state of
by traditional RL agents [11]. Beyond playing planning- the art by viewing affect modeling as an RL paradigm and
based games with superhuman performance, Go-Explore has explicitly blending agent behavior and affect using a cutting
been used for autonomous vehicle control for adaptive stress edge RL algorithm for hard exploration problems. The result
testing [12], and as a mixed-initiative tool for quality-assurance is a set of agents which are tested in games in this initial study.
testing using automated exploration [13]. While Go-Explore The game agents trained to behave (i.e. play) optimally, even
has proved to be a highly effective algorithm for behaviour better than humans, and “feel” like a human would (via arousal
policy search, it has never been tested on affect modeling tasks. imitation), or a blend of the two approaches with varying
This proof-of-concept paper introduces the first application degrees of importance.
of the algorithm for modeling affect as an RL process and
blending it with behavior within a game agent. III. B LENDING B EHAVIOUR AND A FFECT
This paper proposes combining rewards for good behavioral
B. Reinforcement Learning and Affective Computing performance with rewards for affect matching in a rein-
Traditionally, affect modelling [3] involves constructing a forcement learning agent. We leverage the Go-Explore RL
computational model of affect that takes as input the context algorithm and describe our implementation in Section III-A
of the interaction, such as pixels [14], [15], and multimodal and how it is enriched with affect information in Sections
information about a user—including physiological signals III-B and III-C.
[16], facial expressions [17], [18] or speech [19]—and outputs
a predicted corresponding emotional state (i.e. the ground A. Go-Explore Implementation
truth of emotion). Given that affective computing relies on The Go-Explore algorithm builds on two phases to create
a provided ground truth of emotion that is human-annotated, a robust search policy that performs well under a specified
affect detection is naturally viewed as a supervised learning reward scheme received from the environment. The first phase
task [3]. Traditionally a dataset of user state-affect pairs is used is the exploration phase, where a deterministic model of the
to train a model to predict affect [20]. Trained affect models environment is used to explore the search space thoroughly.
are then used in conjunction with action selection methods During exploration an archive of the states encountered so far
for the synthesis, adaptation and affect-based expression of is used to ensure states are not forgotten, thus preventing the
agents including virtual humans [21] and social believable issue of derailment. Each state in the archive also contains the
agents [22]. string of actions needed to return to it, addressing the issue of
Beyond the obvious uses of RL for learning a behavior detachment and ensuring that all states can be visited. States
policy, RL has been used as a paradigm for creative AI are chosen using a selection strategy (e.g. randomly or through
and, in particular, for the procedural generation of content the UCB formula [33]), after which the algorithm returns to
(PCG) [23]. While the experience-driven PCG framework [24] the state as described and begins exploring from there. At
considered the use of affect models beyond the behavior action its simplest, exploration occurs by taking random actions and
space, its initial version never considered RL as a training updating the cell archive with new states or updating existing

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
repeated several times, moving the starting point further back
until the beginning of the trajectory is reached and the agent
has been trained on the entire trajectory. To stabilize learning,
Go-Explore extends this method to use multiple trajectories
which are uniformly sampled at the beginning of each learning
episode.
Our implementation of Go-Explore follows the original
approach by Ecoffet et al. [5] (see Fig. 1). An archive of
cells stores the game states that have been visited, with each
cell representing a unique game state and containing the
instructions needed to reach that point in the game. Each cell
has an associated reward value, which is used to determine if
the cell should be updated in case a similar state with a better
score is found. Cells are chosen to explore from randomly,
and the actions taken to build trajectories during exploration
are also random. Along with the action trajectory to return
to its state, each cell also contains trajectories for the state
with accompanying cumulative behavioral and affect rewards
per trajectory. This implementation of Go-Explore differs from
the original version through the inclusion of affect (i.e. arousal
in this study) in the reward function. Moreover, in this paper
the robustification phase of Go-Explore is not carried out but
will be explored in future work.
B. Arousal Model
A natural question arising when one is asked to blend
behavior and affect within a learning process is how the two
pieces of information will be considered and fused. An obvious
requirement is that the human annotations of affect are time-
continuous, thus providing moment-to-moment information
about the change of affective states and aligning them with
game states stored in playtraces.
One approach for calculating an affect reward would be to
build a priori models of affect using supervised learning and
use their predicted outcomes indirectly as affect-based reward
functions. Instead, one could use the affect labels directly
and build reward functions based on this information. Rather
than relying on a trained surrogate model of arousal in a
given state, our algorithm queries a dataset of human arousal
demonstrations to find the arousal value of the human player
Fig. 1. A high-level overview of Go-Explore that blends agent behavior and closest to the current game state. We use the playtraces and
affect.
their associated arousal traces directly to assess the player’s
arousal value in that state which, in turn, provides the intended
ones with better reward values. The move selection strategy arousal goal at this point in time.
can be improved according to the nature of the environment C. Reward Function
being searched and through the use of expert knowledge.
The reward function used for this version of Go-Explore
The result of the exploration phase is a number of high per-
consists of two weighted functions for optimizing behavior
forming trajectories using the deterministic model. If required,
and imitating human affect respectively. Both components are
the robustification phase uses the “backward algorithm” [34]
normalized within the range [0, 1] to avoid uneven weighting
to train an agent to perform at the same level (or better)
between the two objectives. In particular the reward function
as the trajectories found in exploration, but in a stochastic
used, Rλ , is as follows:
setting. The backward algorithm is an RL technique used to
learn from a given trajectory by decomposing the problem
Rλ = λ · Ra + (1 − λ) · Rb (1)
into smaller exploration tasks. It starts by placing the agent
near the end of the trajectory and uses an off-the-shelf RL where Ra and Rb are the rewards associated with affect and
algorithm to train the agent to imitate its last segment. This is behavior, respectively, and λ is a weighting parameter that

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
blends the two rewards. Formally, the reward associated to
affect (i.e. arousal in this paper) is computed as follows:
n
1X
Ra = (1 − |h(i) − a(i)|) (2)
n i=0
where i is a playtrace and affect annotation observation within
a time-window; n is the number of observations made so
far in this trajectory; h(i) is the agent’s estimated arousal
value in its current game state; a(i) is the arousal goal at
this point in the game. In this paper, we derive h(i) and a(i)
directly from human playtraces and their accompanied affect
annotations (see Section III-B). Specifically we calculate a(i)
by first creating a mean arousal trace averaging all players’
arousal values in the same timestamp: this creates a moment-
to-moment arousal trace that captures the consensus of players
(regardless of actual game context). a(i) is then calculated by
finding the arousal value of this mean arousal trace for that Fig. 2. Endless Runner Game Layout
time window i. On the other hand, h(i) is based on the agent’s
current game state, finding the annotated arousal value of a
human playtrace at any timestamp which has an accompanying by moving up or down (via keyboard input) and/or using a
game state closest to the agent’s game state. melee attack described below. Game objects are placed on
Ra minimizes the absolute difference between the arousal one of two lanes (upper/lower) and are spawned at random
value of a human player in a similar game state as the agent, intervals. Game objects include items that the player may
and the mean annotated arousal value at this time window collide with to improve their score (coins) or alter their
i. Since this difference is averaged across the number of movement speed (potions). Other game objects are obstacles
observations made so far, it encourages trajectories with high (which include immobile enemies); the player must use their
imitation accuracy across the whole arousal trace generated. melee attack when in close proximity to the obstacle in order
The reward for the agent’s behavior (Rb ) depends on the to clear it. Colliding with an obstacle results in a 10 point score
game; in this paper we assume that the total score accumulated penalty, destroys any nearby game objects on the screen, and
throughout the game is a sufficient reward for optimal behav- resets the player’s speed to the default value. Every 3 seconds
ior. This assumes that the environment follows arcade game the player is passively awarded a point to their game score
tropes which are played for high-score, as is the case in our on top of any bonus points they may receive for collecting
case study described in Section IV. In more complex games, coins. Every 10 seconds the speed of the player increases
or in games without an explicit score, the reward signal must by a fixed amount, increasing the difficulty of the game. In
be designed on an ad-hoc basis such as the reward function theory, the game can be played for as long as the player wants.
used in the original implementation of Go-Explore [5]. During data collection for the AGAIN dataset, an Endless
According to Eq. (1), if λ = 0 the reward function trains the session ended after exactly two minutes and the player has
agent to only maximise its score (i.e. optimize its behavior) infinite lives. We follow the same duration in all experiments
and ignore its associated arousal trace. On the other hand, if in this paper in order to leverage players’ affect annotations
λ = 1 the agent is trained to imitate human arousal and ignore and compare the agents’ performance with human play.
its behavior. B. Go-Explore for Endless Runner
IV. C ASE S TUDY: E NDLESS RUNNER The game was converted into a deterministic environment to
The proposed vision of blending arousal and performance be compatible with the exploration phase of Go-Explore. The
rewards is tested in the “Endless Runner” game (hereafter sequence of objects to be spawned and their spawn times was
Endless). Endless is a platformer game built using the Unity fixed to ensure the same sequence of game states are observed
Engine and featured in the AGAIN dataset [35]. The game when replaying trajectories. Moreover, the game could start
was chosen for its simple mechanics and objective, and for its from any saved snapshot (i.e. any visited game state). This
accompanying dataset of 112 annotated human play sessions minimizes the time spent returning to a new cell’s state and
that can be easily used for the arousal model. allows the algorithm to focus on exploration, an approach
central to the Go-Explore paradigm [5]. To decide which
A. Game Description cell the game state should be assigned to, the game state
In Endless, the player controls an avatar that constantly is mapped as an 8-parameter vector describing the player’s
moves towards the right and must avoid or destroy obstacles current lane (two binary values, one per lane), and which
that spawn in their path. The platform consists of two lanes game objects are on each lane at specific distance bands (near,
(top/bottom) and the player’s only controls is switching lanes mid-distance, and far). The possible values for these bands

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
are empty, item, or obstacle, and in case items and obstacles TABLE I
exist in the same band, it is treated as an obstacle band. The R ESULTS FOR E NDLESS AVERAGED FROM 5 RUNS AND INCLUDING THE
95% CONFIDENCE INTERVALS .
reward for optimal behavior (Rb ) in Endless is the player’s
total score after an action is taken. This value is normalized Experiment Performance Measures
between 0 and 1 with respect to the optimal score achievable Setup Rb Ra Rλ
R0.0 0.79 (±0.0474) 0.72 (±0.0126) 0.79 (±0.0474)
in the play session. The optimal in-game score is calculated R0.25 0.73 (±0.0818) 0.73 (±0.0181) 0.73 (±0.0569)
by summing two components. The first is the total amount R0.5 0.74 (±0.0741) 0.74 (±0.0145) 0.74 (±0.0311)
of points awarded to the player passively over time for not R0.75 0.69 (±0.0658) 0.76 (±0.0147) 0.74 (±0.0082)
R1.0 0.25 (±0.1335) 0.79 (±0.0056) 0.79 (±0.0056)
dying during the game. The second is the maximum amount
Random 0.03 (±0.1012) 0.75 (±0.0074) N/A
of bonus points achievable by picking up every coin in the Human 0.70 (±0.0467) 0.77 (±0.0131) N/A
deterministic environment.

C. Experimental Protocol
Reported results per method are averaged across five inde- imitation tend to converge to their maximum value quicker due
pendent runs of the Go-Explore algorithm. Each run consists to the nature of the arousal reward function. Since at R0.0 the
of the exploration phase of Go-Explore (there is no robusti- total reward amounts to a normalized measure of the agent’s
fication phase in this first experiment), and the agent returns in-game score, it is not surprising that high scores are only
and explores 4,000 times before selecting the best trajectory attainable at late points in the game. Instead, states that match
and saving it. The agent explores a maximum of 20 actions the mean arousal trace seem to be easily discovered even early
before choosing a new state to explore from. The actions taken in the game.
during exploration are chosen at random among the 6 possible Looking at the results for the behavioral component (i.e.
options (move up or down, move up or down and attack, no the total game score normalized to the absolute best possible
action and attack). The new state to explore from is chosen score), the random agent shows the worst performance as one
at random among those already discovered: the reward of the would expect when playing most games. While the exploration
state in the archive, or the number of times it has been visited is phase of Go-Explore relies on a random sequence of actions,
not considered. The best trajectories are saved and can be used the discovery of interim states (cells) to explore from and
for the robustification phase of Go-Explore in future work. the optimization of these states based on Rλ clearly leads
The λ parameter of Eq. 1 was varied to observe the to a more efficient playstyle than random. For R1.0 , the
relationship between learning to play the game optimally and agent still manages to produce a better score than the random
learning to imitate human annotated arousal. Table I shows the agent but remains significantly lower than the average human
five values used for the λ parameter, ranging from 0 to 1 in player. Random and R1.0 also display a wider confidence
increments of 0.25. Recall that at λ = 0 and λ = 1 the agent interval compared to the rest of the experiments, pointing to an
tries to learn to solely behave optimally or to solely “feel” inconsistent behavior. When the behavior component is intro-
like a human respectively. As a baseline, an experiment with duced with a small weight (e.g. R0.75 ), the score immediately
an agent that performed random actions was carried out and matches that of the average human demonstration. As λ is
results are averaged from 5 independent runs. To estimate this lowered to zero, the agent’s score improves and surpasses
random agent’s arousal levels, a trace was generated based human levels of performance. Figure 3a illustrates how the
on the game states visited using the same approach as in the agents’ cumulative behavior reward changes over time for each
Go-Explore experiments. configuration. As noted above, the cumulative behavior reward
The results were compared to the average performance seen is very time-dependent by design (players reach higher scores
by humans in the dataset for both behavior and arousal reward the longer they play) but clearly the random agent (and R1.0
functions. All results given are the average observed across the to a degree) tends to lose score by hitting obstacles which
5 runs of Go-Explore, paired with the 95% confidence interval. seems to perfectly offset passive score gains.
The results for the arousal component tell a similar story
D. Results to the results for behavior, with the exception of the random
Table I shows the final values observed for the cumulative agent. Unsurprisingly, the arousal score increases as λ in-
behavior (Rb ) and arousal (Ra ) components, as well as the creases from 0 to 1. What is surprising however is the arousal
overall reward function (Rλ ) for each experiment. Note that scores attained by the random agent, which seem to be almost
the baseline agent and human entries are not included in the at the same levels as the human trace and is only significantly
Rλ column as they were not trained using Go-Explore. Figure surpassed by R1.0 . The potential reasons for this are discussed
3c illustrates how the agents’ overall cumulative reward fluc- in section V. Figure 3b illustrates how the agents’ cumulative
tuates over time for each Go-Explore configuration. Note that arousal reward changes over time for each configuration. It is
due to different λ values, the Rλ values across experiments evident that unlike Rb which is tied to the game score, it is
are not comparable but the differences in how it fluctuates easy to attain high values in Rb early on, and it is also easy
over time provides insight into the behavior of the algorithm. to maintain the same levels throughout the game even when
It is clear that agents with higher priority assigned to arousal performing random actions.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
V. D ISCUSSION

This paper envisions how affect modeling and expression

can be realised thought the RL paradigm. In particular, we
investigate how arousal traces can be used as human demon-
strations that train a gameplaying agent to learn how to feel
like a human. In the simple testbed of Endless Runner, the
large number of annotated playtraces allowed us to match an
agent’s game state to a human player’s game state and use
the player’s annotated arousal level directly. Results indicate
that, as expected, updating the cells of Go-Explore based on
the agent’s in-game performance (a normalized version of the
game score) leads to optimal behavior policies that surpass
the average human scores. Combining this performance-based
reward with an arousal-based reward that aimed to mimic
human annotations resulted in a minor drop of performance
which, nevertheless, remains human-competitive. Evidently,
(a) Cumulative Behavior Reward (Rb ) using this arousal-based policy alone was detrimental to game-
play performance and points to some limitations of the current
way that Ra of Eq. 2 is calculated.
The fact that for all agents, including the random action
baseline, the cumulative arousal reward swiftly reached high
values points to a task that is overly easy. It seems that deriving
a policy only based on Ra does not motivate the agent to
explore many different states, although the number of updates
or new cells encountered in Go-Explore has not been studied
sufficiently to verify this hypothesis. Moreover, it should be
noted that the human annotations of arousal were processed
in an unbounded, ordinal fashion and normalized after the
fact. While most players follow a similar pattern of increasing
arousal as the game goes on, using the numerical difference
between one human’s arousal value (closest to the agent’s
game state) and the mean could reintroduce subjectivity biases
due to the normalization applied. Designing another reward
function for arousal that better matches the ordinal nature
(b) Cumulative Arousal Reward (Ra ) of affect [36], [37] would be an important direction for
future work. Finally, both performance and affect rewards
are measured cumulatively, in part due to the fact that the
former is the player’s score. Exploring different variants by
e.g. averaging either score increase or arousal similarity across
a narrower time window is expected to have an effect on
agents’ performance.
It is also worth noting that the Endless Runner testbed
has a low branching factor and a deterministic game state.
Therefore each experiment was subjected to a very similar
sequence of game states. While the simple game still showed
that performance-based optimization via Go-Explore is vastly
superior to a random agent, it may have affected the arousal
model in unexpected ways. Due to the few visited game states,
it is likely that the range of values that could be returned by the
arousal model was small, which is a likely cause for the agents’
similar arousal accuracy and small confidence intervals across
(c) Cumulative Total Reward (Rλ )
the board. Furthermore, when identifying the closest human
for the arousal model, a relatively small subset of sessions
Fig. 3. Cumulative rewards averaged from 5 independent runs. Shaded areas
denote the 95% confidence interval. are used for computational efficiency which further limits the
range of arousal values that could be observed. Changing the

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
approach for deriving an arousal value for a given state with [4] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley,
this limitation in mind would help generate more diverse traces and Jeff Clune, “Montezuma’s revenge solved by go-explore, a new
algorithm for hard-exploration problems (sets records on pitfall, too),”
and allow the differences in the reward functions to become Uber Engineering Blog, 2018.
more pronounced. A more complex game where the agent has [5] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and
more degrees of freedom and more arousing stimuli for the Jeff Clune, “First return, then explore,” Nature, vol. 590, no. 7847, pp.
580–586, 2021.
human playtesters will also likely illuminate the strengths and [6] Richard S Sutton and Andrew G Barto, Reinforcement learning: An
weaknesses of this approach. introduction, MIT press, 2018.
This proof of concept opens up several avenues for future [7] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and
Jeff Clune, “Go-explore: a new approach for hard-exploration problems,”
work to further explore the relationship between behavior and arXiv preprint arXiv:1901.10995, 2019.
affect in the context of reinforcement learning. Obvious next [8] Georgios N. Yannakakis and Julian Togelius, Artificial Intelligence and
steps have been highlighted above in terms of refining the Games, Springer, 2018, http://gameaibook.org.
arousal reward function and testing the approach in more [9] Prithviraj Ammanabrolu, Ethan Tien, Zhaochen Luo, and Mark O. Riedl,
“How to avoid being eaten by a grue: Exploration strategies for text-
complex, more stimulating games. Another direction is testing adventure agents,” arXiv preprint arXiv:2002.08795, 2020.
machine-learned predictors of affective states rather than the [10] Andrea Madotto, Mahdi Namazifar, Joost Huizinga, Piero Molino,
direct mapping to the closest human trace performed currently. Adrien Ecoffet, Huaixiu Zheng, Alexandros Papangelis, Dian Yu, Chan-
dra Khatri, and Gokhan Tur, “Exploration based language learning for
While surrogate models are often inexact, it may counteract text-based games,” arXiv preprint arXiv:2001.08868, 2020.
the sparse game states encountered by human players when [11] Guillaume Matheron, Nicolas Perrin, and Olivier Sigaud, “PBCS:
matching an unseen state. More importantly, incorporating Efficient exploration and exploitation using a synergy between reinforce-
ment learning and motion planning,” in Proceedings of the International
the robustification phase in the Go-Explore algorithm is ex- Conference on Artificial Neural Networks. Springer, 2020, pp. 295–307.
pected to lead to new insights on the impact of affect-based [12] Mark Koren and Mykel J. Kochenderfer, “Adaptive stress testing without
rewards, especially since the environment will no longer be domain heuristics using go-explore,” arXiv preprint arXiv:2004.04292,
2020.
deterministic and thus many more game states are likely to be [13] Kenneth Chang, Batu Aytemiz, and Adam M Smith, “Reveal-more:
visited. Finally, imitating human behavior (as a form of reward Amplifying human effort in quality assurance testing using automated
function) can reveal interesting new relationships between the exploration,” in Proceedingos of the IEEE Conference on Games, 2019.
human-like behavior and affect and optimal play; such derived [14] Konstantinos Makantasis, Antonios Liapis, and Georgios N. Yannakakis,
“From pixels to affect: A study on games and player experience,” in
policies would likely allow agents to play (near) optimally, Proceedings of the International Conference on Affective Computing and
whilst attempting to imitate both human behavior and human Intelligent Interaction, 2019.
affect. [15] Konstantinos Makantasis, Antonios Liapis, and Georgios N Yannakakis,
“The pixels and sounds of emotion: General-purpose representations of
arousal in games,” IEEE Transactions on Affective Computing, 2021.
VI. C ONCLUSION [16] Hector P Martinez, Yoshua Bengio, and Georgios N Yannakakis,
“Learning deep physiological models of affect,” IEEE Computational
This paper presents a proof of concept implementation of a intelligence magazine, vol. 8, no. 2, pp. 20–33, 2013.
new reinforcement learning paradigm for affective computing [17] Adria Ruiz, Ognjen Rudovic, Xavier Binefa, and M. Pantic, “Multi-
instance dynamic ordinal random fields for weakly supervised facial
where behavioral and affective goals are interwoven. We behavior analysis,” IEEE Transactions on Image Processing, vol. 27,
leverage the Go-Explore algorithm due to its cutting edge pp. 3969–3982, 2018.
ability to solve hard exploration problems, and we pair it [18] R. Walecki, Ognjen Rudovic, V. Pavlovic, B. Schuller, and M. Pantic,
“Deep structured learning for facial action unit intensity estimation,” in
with a set of reward functions that blend optimal behavior Proceedings of the IEEE Conference on Computer Vision and Pattern
with arousal imitation to different degrees. Using the Endless Recognition, 2017, pp. 5709–5718.
Runner game as a platform to test the implementation, we [19] George Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, Mihalis A.
were able to make use of an extensive dataset of human Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-
end speech emotion recognition using a deep convolutional recurrent
play sessions and accompanying arousal demonstrations that network,” in Proceedings of the IEEE International Conference on
guided the agent’s policy. While this initial study focused on Acoustics, Speech and Signal Processing, 2016, pp. 5200–5204.
a single, simple game, the next steps of our investigations [20] Sander Koelstra, C. Mühl, M. Soleymani, Jong-Seok Lee, A. Yazdani,
T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras, “Deap: A database for
include the enhancement of the Go-Explore approach to cater emotion analysis ;using physiological signals,” IEEE Transactions on
for its robustification phase, the introduction of ordinal reward Affective Computing, vol. 3, pp. 18–31, 2012.
functions, and the extension of the approach to accommodate [21] William R Swartout, Jonathan Gratch, Randall W Hill Jr, Eduard Hovy,
Stacy Marsella, Jeff Rickel, and David Traum, “Toward virtual humans,”
more complex environments within and beyond games. AI Magazine, vol. 27, no. 2, pp. 96–96, 2006.
[22] W Scott Reilly, “Believable social and emotional agents.,” Tech. Rep.,
R EFERENCES Carnegie-Mellon Univ Pittsburgh pa Dept of Computer Science, 1996.
[23] Ahmed Khalifa, Philip Bontrager, Sam Earle, and Julian Togelius,
[1] Stacy Marsella, Jonathan Gratch, Paolo Petta, et al., “Computational “Pcgrl: Procedural content generation via reinforcement learning,” arXiv
models of emotion,” A Blueprint for Affective Computing-A sourcebook preprint arXiv:2001.09212, 2020.
and manual, vol. 11, no. 1, pp. 21–46, 2010. [24] Georgios N Yannakakis and Julian Togelius, “Experience-driven pro-
[2] Stacy C Marsella and Jonathan Gratch, “Ema: A process model of cedural content generation,” in Proceedings of the International Con-
appraisal dynamics,” Cognitive Systems Research, vol. 10, no. 1, pp. ference on Affective Computing and Intelligent Interaction. IEEE, 2015,
70–90, 2009. pp. 519–525.
[3] Rafael A Calvo and Sidney D’Mello, “Affect detection: An interdis- [25] Tianye Shu, Jialin Liu, and Georgios N Yannakakis, “Experience-
ciplinary review of models, methods, and their applications,” IEEE driven PCG via reinforcement learning: A Super Mario Bros study,”
Transactions on Affective Computing, vol. 1, no. 1, pp. 18–37, 2010. in Proceedings of the IEEE Conference on Games, 2021.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.
[26] Thomas M Moerland, Joost Broekens, and Catholijn M Jonker, “Emo-
tion in reinforcement learning agents and robots: a survey,” Machine
Learning, vol. 107, no. 2, pp. 443–480, 2018.
[27] Joost Broekens, Walter A Kosters, and Fons J Verbeek, “On affect
and self-adaptation: Potential benefits of valence-controlled action-
selection,” in International Work-Conference on the Interplay Between
Natural and Artificial Computation. Springer, 2007, pp. 357–366.
[28] Cyril Hasson, Philippe Gaussier, and Sofiane Boucenna, “Emotions
as a dynamical system: the interplay between the meta-control and
communication function of emotions,” Paladyn, vol. 2, no. 3, pp. 111–
125, 2011.
[29] Satinder Singh, Andrew G Barto, and Nuttapong Chentanez, “Intrin-
sically motivated reinforcement learning,” Tech. Rep., Massachusetts
University, Amherst Department of Computer Science, 2005.
[30] Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan
Sorg, “Intrinsically motivated reinforcement learning: An evolutionary
perspective,” IEEE Transactions on Autonomous Mental Development,
vol. 2, no. 2, pp. 70–82, 2010.
[31] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre,
Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas, “Social
influence as intrinsic motivation for multi-agent deep reinforcement
learning,” in Proceedings of the International Conference on Machine
Learning. PMLR, 2019, pp. 3040–3049.
[32] Léonard Hussenot, Robert Dadashi, Matthieu Geist, and Olivier Pietquin,
“Show me the way: Intrinsic motivation from demonstrations,” arXiv
preprint arXiv:2006.12917, 2020.
[33] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M
Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego
Perez, Spyridon Samothrakis, and Simon Colton, “A survey of monte
carlo tree search methods,” IEEE Transactions on Computational
Intelligence and AI in games, vol. 4, no. 1, pp. 1–43, 2012.
[34] Tim Salimans and Richard Chen, “Learning montezuma’s revenge from
a single demonstration,” arXiv preprint arXiv:1812.03381, 2018.
[35] David Melhart, Antonios Liapis, and Georgios N. Yannakakis,
“The Affect Game AnnotatIoN (AGAIN) dataset,” arXiv preprint
arXiv:2104.02643, 2021.
[36] Georgios N Yannakakis, Roddy Cowie, and Carlos Busso, “The ordinal
nature of emotions,” in Proceedings of the International Conference on
Affective Computing and Intelligent Interaction. IEEE, 2017, pp. 248–
255.
[37] Georgios N Yannakakis, Roddy Cowie, and Carlos Busso, “The ordinal
nature of emotions: An emerging approach,” IEEE Transactions on
Affective Computing, vol. 12, no. 1, pp. 16–35, 2018.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO CEARA. Downloaded on August 06,2022 at 17:06:22 UTC from IEEE Xplore. Restrictions apply.

BA Module 02 - Quiz
100% (10)
BA Module 02 - Quiz
22 pages
FreeDoge Coin Script
38% (26)
FreeDoge Coin Script
2 pages
Barros Affective Memory 2017 Webpage
No ratings yet
Barros Affective Memory 2017 Webpage
8 pages
IJSSCI 4203 Affective Emotions
No ratings yet
IJSSCI 4203 Affective Emotions
23 pages
Zhang et al. - 2024 - Simulating Emotions With an Integrated Computational Model of Appraisal and Reinforcement Learning
No ratings yet
Zhang et al. - 2024 - Simulating Emotions With an Integrated Computational Model of Appraisal and Reinforcement Learning
12 pages
Thesis Reinforcement Learning
100% (2)
Thesis Reinforcement Learning
5 pages
Go-Explore: A New Approach For Hard-Exploration Problems
No ratings yet
Go-Explore: A New Approach For Hard-Exploration Problems
37 pages
Lecture#1_RL An Introduction 2023
No ratings yet
Lecture#1_RL An Introduction 2023
44 pages
Skill-Based Curiosity For Intrinsically Motivated Reinforcement Learning
No ratings yet
Skill-Based Curiosity For Intrinsically Motivated Reinforcement Learning
20 pages
s40708-023-00196-6
No ratings yet
s40708-023-00196-6
37 pages
Probabilistic Assessment of User's Emotions in Educational Games
No ratings yet
Probabilistic Assessment of User's Emotions in Educational Games
20 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
Module 01
No ratings yet
Module 01
66 pages
Emotions and Gesture Recognition Using Affective Computing Assessment With Deep Learning
No ratings yet
Emotions and Gesture Recognition Using Affective Computing Assessment With Deep Learning
9 pages
Assessment of User Affective and Belief States For Interface Adaptation: Application To An Air Force Pilot Task
No ratings yet
Assessment of User Affective and Belief States For Interface Adaptation: Application To An Air Force Pilot Task
48 pages
Module 1
No ratings yet
Module 1
72 pages
21581-13-25594-1-2-20220628
No ratings yet
21581-13-25594-1-2-20220628
2 pages
Affect and action- Towards an event-coding account
No ratings yet
Affect and action- Towards an event-coding account
28 pages
Affective Computing Report
No ratings yet
Affective Computing Report
19 pages
AffectNet Onecolumn-2
No ratings yet
AffectNet Onecolumn-2
31 pages
Abstract:: Keywords: Emotion Detection, Natural Language Processing, Adversarial Transfer Learning
No ratings yet
Abstract:: Keywords: Emotion Detection, Natural Language Processing, Adversarial Transfer Learning
17 pages
Learning and Emotional Intelligence in Agents
No ratings yet
Learning and Emotional Intelligence in Agents
6 pages
AS01
No ratings yet
AS01
14 pages
Affective Computing: Recent Advances, Challenges, and Future Trends
No ratings yet
Affective Computing: Recent Advances, Challenges, and Future Trends
24 pages
Affective Analysis in Machine Learning Using AMIGOS With Gaussian Expectation-Maximization Model
No ratings yet
Affective Analysis in Machine Learning Using AMIGOS With Gaussian Expectation-Maximization Model
9 pages
Literation Survey
No ratings yet
Literation Survey
4 pages
Advanced Cognition - L01 - Introduction
No ratings yet
Advanced Cognition - L01 - Introduction
65 pages
MullinsSabherwal GamificationCognitiveEmotionalView
No ratings yet
MullinsSabherwal GamificationCognitiveEmotionalView
35 pages
Deep Reinforcement Learning and Its Neuroscientific Implications
No ratings yet
Deep Reinforcement Learning and Its Neuroscientific Implications
22 pages
Naturalistic-reinforcement-learning_2023_tics
No ratings yet
Naturalistic-reinforcement-learning_2023_tics
15 pages
Surprise As Shortcut For Anticipation: Clustering Mental States in Reasoning
No ratings yet
Surprise As Shortcut For Anticipation: Clustering Mental States in Reasoning
6 pages
A Formal Model of Emotion Triggers: An Approach For BDI Agents
No ratings yet
A Formal Model of Emotion Triggers: An Approach For BDI Agents
47 pages
21ai020 & Reinforcement Learning UNIT 1-LM:1
No ratings yet
21ai020 & Reinforcement Learning UNIT 1-LM:1
8 pages
AI Report Copy 1
No ratings yet
AI Report Copy 1
3 pages
Emotional Agents Modelling and Application
No ratings yet
Emotional Agents Modelling and Application
22 pages
An Automatic Method To Identify E-Learner
No ratings yet
An Automatic Method To Identify E-Learner
17 pages
LEC 10
No ratings yet
LEC 10
58 pages
Affective Games: A Multimodal Classification System: October 2018
No ratings yet
Affective Games: A Multimodal Classification System: October 2018
9 pages
Affective Games: A Multimodal Classification System: October 2018
No ratings yet
Affective Games: A Multimodal Classification System: October 2018
9 pages
Robot Learning Driven by Emotions: Adaptive
No ratings yet
Robot Learning Driven by Emotions: Adaptive
23 pages
Final
No ratings yet
Final
18 pages
1564421887_Project_Proposal_ template_MIE
No ratings yet
1564421887_Project_Proposal_ template_MIE
15 pages
Agent Decision-Making Under Uncertainty: Towards A New E-BDI Agent Architecture Based On Immediate and Expected Emotions
No ratings yet
Agent Decision-Making Under Uncertainty: Towards A New E-BDI Agent Architecture Based On Immediate and Expected Emotions
8 pages
Recognition of Emotions
No ratings yet
Recognition of Emotions
11 pages
2503.12761v1
No ratings yet
2503.12761v1
27 pages
RL Cab
No ratings yet
RL Cab
9 pages
Designing For Automatic Affect Inference in Learning Environments
No ratings yet
Designing For Automatic Affect Inference in Learning Environments
14 pages
Eckstein et al 2021_What do reinforcement models measure
No ratings yet
Eckstein et al 2021_What do reinforcement models measure
10 pages
Un Marco Genérico de Autoaprendizaje Emocional para Máquinas
No ratings yet
Un Marco Genérico de Autoaprendizaje Emocional para Máquinas
17 pages
2102.04399
No ratings yet
2102.04399
9 pages
Reinforcement learning
No ratings yet
Reinforcement learning
7 pages
A Probabilistic Map of Emotional Experiences During Competitive Social Interactions
No ratings yet
A Probabilistic Map of Emotional Experiences During Competitive Social Interactions
11 pages
Affective Computing
No ratings yet
Affective Computing
3 pages
IMFED - Digital Marketing
No ratings yet
IMFED - Digital Marketing
11 pages
2012 - A Framework For Emergent Emotions, Based On Motivation and Cognitive Modulators
No ratings yet
2012 - A Framework For Emergent Emotions, Based On Motivation and Cognitive Modulators
24 pages
Disertatie
No ratings yet
Disertatie
5 pages
1 s2.0 S0019057821006133 Main
No ratings yet
1 s2.0 S0019057821006133 Main
14 pages
Affective BrainComputer Interfaces ABCIs a Tutorial
No ratings yet
Affective BrainComputer Interfaces ABCIs a Tutorial
19 pages
Winter Semester 2023-24_CSE4037_ETH_AP2023246000594_2024-01-05_Reference-Material-I
No ratings yet
Winter Semester 2023-24_CSE4037_ETH_AP2023246000594_2024-01-05_Reference-Material-I
35 pages
Agents, Emotional Intelligence and Fuzzy Logic
No ratings yet
Agents, Emotional Intelligence and Fuzzy Logic
5 pages
Distributed Artificial Intelligence: Fundamentals and Applications
From Everand
Distributed Artificial Intelligence: Fundamentals and Applications
Fouad Sabry
No ratings yet
Feedback: The Role of Dynamic Systems in Autonomous Robotics
From Everand
Feedback: The Role of Dynamic Systems in Autonomous Robotics
Fouad Sabry
No ratings yet
4 Binary Number System
50% (2)
4 Binary Number System
22 pages
Data Heat Exchanger
No ratings yet
Data Heat Exchanger
2 pages
ATS-500 Series Hardwire with MK-4000C
No ratings yet
ATS-500 Series Hardwire with MK-4000C
1 page
Me-eepme Cbcs 23-24 Syllabus (2)
No ratings yet
Me-eepme Cbcs 23-24 Syllabus (2)
34 pages
Horace Lamb - Statics: Including Hydrostatics and The Elements of The Theory of Elasticity
No ratings yet
Horace Lamb - Statics: Including Hydrostatics and The Elements of The Theory of Elasticity
364 pages
Pair of Linear Equation
100% (1)
Pair of Linear Equation
10 pages
Volume of Prisms (Answers)
No ratings yet
Volume of Prisms (Answers)
3 pages
CAE Pipe Tutorial 2
100% (1)
CAE Pipe Tutorial 2
88 pages
LEC6 (2)
No ratings yet
LEC6 (2)
15 pages
Aci 445 PDF
100% (3)
Aci 445 PDF
55 pages
Fabm2 - 8.2
No ratings yet
Fabm2 - 8.2
15 pages
Pyrol System
100% (1)
Pyrol System
67 pages
An Old Sumrë Reference Grammar
100% (1)
An Old Sumrë Reference Grammar
37 pages
1) What Is The Collection Framework in Java?
No ratings yet
1) What Is The Collection Framework in Java?
14 pages
Rate of Change
No ratings yet
Rate of Change
22 pages
ME364-06 Revision Fatigue
No ratings yet
ME364-06 Revision Fatigue
78 pages
Multipath Subsystem Device Driver User's Guide: IBM System Storage
No ratings yet
Multipath Subsystem Device Driver User's Guide: IBM System Storage
156 pages
Math in The Modern World Handout
No ratings yet
Math in The Modern World Handout
2 pages
Large Object Data Types (Blob, Clob, Dbclob) :: Character Large Objects (Clobs) - A Character String Made Up
No ratings yet
Large Object Data Types (Blob, Clob, Dbclob) :: Character Large Objects (Clobs) - A Character String Made Up
4 pages
Kirchoff's Law
No ratings yet
Kirchoff's Law
19 pages
TeSys T Quick Start
No ratings yet
TeSys T Quick Start
38 pages
Credit Card Fraud Detection Using Machine Learning
100% (1)
Credit Card Fraud Detection Using Machine Learning
5 pages
PNA9 - Weight and Stability - Lecture
No ratings yet
PNA9 - Weight and Stability - Lecture
26 pages
Download full Chemistry 13th Edition Raymond Chang ebook all chapters
100% (2)
Download full Chemistry 13th Edition Raymond Chang ebook all chapters
52 pages
PART-66 Training - Aircraft Maintenance Licence
No ratings yet
PART-66 Training - Aircraft Maintenance Licence
11 pages
Most Useful Excel Shortcuts: Managing Workbooks
No ratings yet
Most Useful Excel Shortcuts: Managing Workbooks
2 pages
Ds Roadmap 2
No ratings yet
Ds Roadmap 2
9 pages
Basic Ventilation
No ratings yet
Basic Ventilation
64 pages

Go Blend Behavior and Affect

Uploaded by

Go Blend Behavior and Affect

Uploaded by

2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

Go-Blend Behavior and Affect

Matthew Barthet Antonios Liapis Georgios N. Yannakakis

978-1-6654-0021-3/21/$31.00 ©2021 IEEE

This paper envisions how affect modeling and expression

You might also like