Active Inference and Cognitive Control_v7b
Active Inference and Cognitive Control_v7b
Optimization
Riccardo Proietti1, Thomas Parr2, Alessia Tessari3,4, Karl Friston5,6, Giovanni Pezzulo,1,*
1. Institute of Cognitive Sciences and Technologies, National Research Council, Rome, Italy
2. Nuffield Department of Clinical Neurosciences, University of Oxford, UK
3. Department of Psychology, University of Bologna, Italy
4. Alma Mater Research Institute for Human-Centered Artificial Intelligence, University of
Bologna, Bologna, Italy
5. Wellcome Centre for Human Neuroimaging, Queen Square Institute of Neurology, University
College London, London, UK
6. VERSES AI Research Lab, Los Angeles, CA 90016, USA
* Corresponding author:
Giovanni Pezzulo
ISTC-CNR
Email: [email protected]
1
Abstract
We advance a novel formulation of cognitive control within the active inference framework. The theory
proposes that cognitive control amounts to optimising a precision parameter, which acts as a control
signal and balances the contributions of deliberative and habitual components of action selection. To
illustrate the theory, we simulate a driving scenario in which the driver follows a well-known route, but
encounters unexpected challenges. Our simulations show that a standard active inference model can
form adaptive habits; i.e., can pass from deliberative to habitual control when the context is stable, but
generally fails to revert to deliberative control, when the context changes. To address this failure of
context-sensitivity, we introduce a novel type of hierarchical active inference, in which a lower level is
responsible for behavioural control and the higher (or meta-cognitive) level observes the belief updating
of the lower level below and is responsible for cognitive control. Crucially, the meta-cognitive level
can both form habits and suspend them, by controlling the (precision) parameter that prioritizes
deliberative choices at the behavioural level. Furthermore, we show that several processes linked to
cognitive control — such as surprise detection, cognitive conflict monitoring, control signal regulation
and specification, the simulation of future outcomes and the assessment of the costs of control and
mental effort — stem coherently from the free energy minimization scheme that underpins active
inference. Finally, we discuss the putative neurobiology of cognitive control by simulating brain
dynamics in the mesolimbic and mesocortical pathways of the dopamine system, the dorsal anterior
cingulate cortex and the locus coeruleus.
Keywords: cognitive control; attention to action; active inference; mental effort; dopamine.
2
Introduction
It is common wisdom that “practice makes perfect”. From a cognitive perspective, practice also entails
a gradual passage from an effortful or controlled mode — during the performance of novel or difficult
tasks — to more automatic information processing during the performance of familiar and simpler tasks
(Anderson, 1982; Shiffrin and Schneider, 1977). For example, while a novice driver must devote
significant cognitive resources to each aspect of driving, an experienced driver can often drive in an
effortless fashion, paying little “attention to action”.
A common explanation of this phenomenon is that the control of skilled and novel actions is associated
with two fundamentally different types of brain processes or controllers, which are called automatic
versus controlled (or intentional), habitual versus deliberative, procedural versus goal-directed
processes, or system 1 and system 2 (Balleine and Dickinson, 1998; Daw et al., 2005; Kahneman, 2011;
Norman and Shallice, 1986; Stanovich and West, 2000). The same dichotomy between distinct action
selection mechanisms, or controllers, recurs across lower level motor control processes and higher level
decision making processes (Milli et al., 2021). Hence, to generalize, below we use the word actions to
refer to the products of these two controllers, irrespective of whether they are movements, decisions or
mental actions.
In habitual or procedural control, the task is initiated and executed without deliberative attention and is
performed automatically, based on associative learning, without calling on limited processing resources
and without awareness (Kahneman et al., 1983; Posner, 1978; Shiffrin and Schneider, 1977). This type
of control is engaged for relatively simple or well-learned cognitive and motor tasks. Two hallmarks of
habitual-procedural control are the fact that action initiation can be directly triggered by environmental
cues and that action execution can engage a preconfigured behavioural plan, such as a sequence of
motor acts (Anderson, 1982; Taatgen and Lee, 2003) or action chunks (Dezfouli and Balleine, 2012;
Rumiati and Tessari, 2002; Tessari et al., 2021, 2006). The combination of these two factors ensures
that tasks are generally executed faster (because actions are recalled automatically) and with little
expenditure of cognitive resources (because the agent only needs to monitor the final outcome of the
preconfigured plan, not each constituent motor act). This corresponds to the usual definition of a habit
as a skilled action, which can be engaged with minimal processing resources (Miller et al., 2018).
However, these advantages come at the expense of flexibility: habitual control is only appropriate when
the situation is predictable but can fail in novel or unforeseen circumstances (Moors and De Houwer,
2006; Schneider and Chein, 2003).
On the other hand, deliberative or goal-directed control involves the formation of novel action plans,
the online monitoring of their outcomes and the capability to counteract maladaptive habitual responses
and predispositions. These are all considered cognitively demanding tasks, which is why deliberative
control is associated with mental effort and the engagement of attentional resources (and sometimes
3
conscious processing). Contrary to habitual control, deliberative control is flexible and permits dealing
more effectively with complex and unforeseen circumstances, novel contingencies and volatilities
(Balleine and Dickinson, 1998).
Furthermore, there is growing appreciation of the fact that the arbitration or combination of habitual-
procedural and controlled-deliberative processes rests on a cost-benefit computation that decides
whether and to what extent to engage costly (controlled-deliberative) processes, by balancing pragmatic
benefits of engaging such processes and their associated cognitive effort (Daw et al., 2011; Dolan and
Dayan, 2013; Maisto et al., 2019; Pezzulo et al., 2013). An early attempt to characterize this cost-benefit
computation was provided by the model of attention to action of (Norman and Shallice, 1986).
Accordingly, there are two processes that operate complementarily to select and control action
(although their joint operation could be disrupted under certain conditions). First, a contention
scheduling selects amongst the possible action schemas — which in modern parlance could be called a
policy selection mechanism — that selects among policies or action sequences (Friston et al., 2017;
McClelland & Rumelhart, 1981; Parr et al., 2022; Rumelhart & Norman, 1982; Sutton & Barto, 1998).
Second, a supervisory attentional system provides control upon the selection of action schemas, by
exerting some additional activation (or inhibition) on an action schema, to bias its selection in the
contention scheduling mechanisms. This second mechanism therefore provides some cognitive control
by deploying attention, which requires cognitive effort, especially if the schema to be selected is
unfamiliar (Cooper and Shallice, 2000; Shallice and Burgess, 1993).
A more recent, neurobiologically grounded theory of cognitive control and the deployment of cognitive
effort is the expected value of control (EVC) of Shenhav et al., (2013). This theory proposes that the
allocation of control is based on a cost-benefit evaluation of the payoff that one obtains by engaging a
controlled process and the cost (cognitive effort) required to engage enough control to achieve the
payoff. The theory identifies three key processes of cognitive control. Firstly, the regulation process
describes the capacity of a control mechanism to influence lower-level information processing
mechanisms. Regulation is achieved by a control signal that changes the parameters and functioning of
lower-levels mechanisms, and which has two fundamental features: identity and intensity. Identity
specifies which lower-level parameters are targeted or which behaviour is upweighted and which is
4
inhibited. Conversely, intensity represents the strength of the signal, such as the degree to which the
lower-level parameters are displaced from their default value. Secondly, the specification process is
responsible for the decision of whether to pursue a controlled process and (in case) the actual selection
of the most appropriate control signal, which specifies which of the possible action plans should be
engaged and how intensely (e.g., accurately) they should be pursued. Thirdly, the monitoring process
ensures that the cognitive system has the requisite information for signal specification, which includes
information about current circumstances and whether the current behaviour is affording progress
towards goals. An ample literature suggests that monitoring processes could consider various sources
of information, such as response conflict, response delays, errors, Bayesian surprise, and negative
feedback, which might indicate a need for cognitive control (Badre and Wagner, 2004; Botvinick, 2007;
Botvinick et al., 2001; Koechlin et al., 2003; Koechlin and Summerfield, 2007; Laming, 1968; Rabbitt,
1966; Shenhav et al., 2013). At the neural level, EVC theory posits that the dorsal anterior cingulate
cortex (dACC) is involved in monitoring and specification processes and the lateral prefrontal cortex
(lPFC) in regulation processes. Various other theories assign the dACC a role in performance
monitoring, by computing prediction error signals that result from the comparison between expected
and actual action outcomes (Alexander and Brown, 2011; Silvetti et al., 2011; Vassena et al., 2020).
Here, we advance a novel formulation of cognitive control that is conceptually related to previous
proposals of attention to action (Norman and Shallice, 1986), expected value of control (Shenhav et al.,
2013) and performance monitoring (Alexander and Brown, 2011), but which casts the underlying cost-
benefit optimization in terms of Bayes optimal (active) inference and free energy minimization. We
leverage a previous active inference account of cognitive effort as “the qualitative experience of
committing to a behaviour that diverges from a priori habit” (Parr et al., 2023) and extend it in a
multiagent hierarchical — or meta-cognitive control — setting. In this novel formulation, a
hierarchically higher control (i.e., meta-cognitive) level optimizes the parameters of a lower (i.e.,
behavioural) level, therefore providing an optimal solution to the key problem of cognitive control:
ensuring accurate action selection at the lowest computational cost (Botvinick et al., 2019; Doya, 2002;
Kool et al., 2010; Pezzulo et al., 2015, 2018a; Silvetti et al., 2018).
Besides its normative appeal, our proposal reconciles two separate streams of research that focus on
reward-related (Shenhav et al., 2013) and epistemic aspects of cognitive control, such as environmental
uncertainty and ambiguity (Behrens et al., 2007) and Bayesian surprise (Vassena et al., 2020). As we
will discuss, the expected free energy used in active inference for action selection considers both
pragmatic (goal or reward achievement) and epistemic (uncertainty minimization) imperatives, hence
explaining the two facets of cognitive control. Finally, casting free energy minimization in terms of
gradient descent allows us to simulate neuronal dynamics at several levels. Here, we focus on neuronal
responses in the dorsal anterior cingulate cortex (dACC), the locus coeruleus, and the dopaminergic
5
system, showing noteworthy correspondences with known neurophysiological aspects of cognitive
control.
In what follows, we provide a brief overview of active inference. We subsequently describe three
simulations of a driving task, which exemplify the functioning of an active inference agent without
cognitive control (Simulation 1), with a simple form of cognitive control that only considers the
specification of a control signal (Simulation 2), and a more sophisticated form of (meta-)cognitive
control that additionally covers the decisions of engaging or not engaging deliberation and cognitive
control (Simulation 3). Finally, we discuss how our proposal explains the rich phenomenology and
neurobiology of cognitive control.
Active inference is a normative framework that describes cognitive and brain functions in terms of the
overarching imperative of variational free energy minimization (Friston, 2010; Parr et al., 2022; Pezzulo
et al., 2024). The foundational premise is that any organism is endowed with a (generative) model of
the statistical regularities of its environment and uses this world model to infer both the causes of its
sensations (perception) and the best course of action to achieve preferred outcomes (action planning).
Both perception and planning result from the minimization of a functional — variational free energy
— that bounds the organism's sensory surprise or, from a statistical perspective, the evidence (a.k.a.,
marginal likelihood) for its model of the world (Friston, 2010). At the computational level, free energy
minimization corresponds to a process of approximate (variational) Bayesian inference, whereas at the
neuronal level, it can be associated with the dynamics of neuronal populations that encode predictions
and prediction errors.
When simulating active inference, variational free energy (F) can be computed for each allowable
policy or action sequence (π) and comprises two terms:
F(𝜋)
⏟ =⏟
D𝐾𝐿 [𝑄(𝑠|𝜋) || 𝑃(𝑠|𝜋)] − E
⏟𝑄(𝑠|𝜋) [ln𝑃(𝑜|𝑠)] (1)
𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑎𝑙 𝐹𝐸 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
In Equation 1, the first quantity on the right-hand side is a complexity term that scores a (Kullback
Leibler or KL) divergence between the posterior beliefs about states of an auxiliary distribution (called
a variational density Q(s|π)), and the prior beliefs about (hidden or latent) states of the world (called a
prior density P(s|π)). The second term is an accuracy, which scores the expected (logarithm of) the
probability of observations given beliefs about unobservable states (ln P(o|s)). The auxiliary distribution
Q in active inference corresponds to the brain’s internal probabilistic beliefs about hidden states given
sensory evidence. This distribution is not arbitrary but emerges from Bayesian updating, minimizing
free energy to achieve a balance between accuracy (aligning beliefs with sensory data) and complexity
(maintaining prior expectations). These two terms jointly ensure that the agent engages in a continuous
6
action-perception, by updating its (posterior) beliefs about the states to better fit its observations and by
selecting courses of actions that realise predictions under those posterior beliefs. This means perception
and action fulfil the same (free energy minimization) imperative.
Active inference also treats planning — i.e. the selection of policies or action sequences (π) — as a
form of inference (i.e., planning as inference). However, planning requires including an additional
process of expected free energy minimization, which considers not only present and past information
(as in variational free energy minimization), but also future observations, which the agent can predict
by using its generative model to perform ‘what-if’ simulations. Planning therefore corresponds to
generating possible futures (one for each policy π) and then scoring each policy in terms of expected
free energy and selecting the policy that is expected to minimize free energy in the future.
This expected free energy (G) associated with each policy π considers the prior preferences of the agent
(i.e., extrinsic or pragmatic value) and the expected information gain about states of the world (intrinsic
or epistemic value). These two terms can be rearranged into risk and ambiguity. Risk is the (KL)
divergence between anticipated outcomes given a policy (Q(o|π)) and preferred outcomes (P(o)).
Ambiguity is the expected uncertainty (i.e., conditional entropy H) about outcomes, given the model's
likelihood P(o|s).
G(𝜋)
⏟ =⏟
D𝐾𝐿 [𝑄(𝑜|𝜋) || 𝑃(𝑜)] + ⏟
E𝑄(𝑠|𝜋) [H[𝑃(𝑜|𝑠)]] (2)
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝐹𝐸 𝑅𝑖𝑠𝑘 𝐴𝑚𝑏𝑖𝑔𝑢𝑖𝑡𝑦
These two components of expected free energy ensure that the plans adaptively balance exploitation
(i.e., preference-seeking) and exploration (i.e., information-seeking).
In summary, in active inference, the action-perception loop and planning processes are realized via the
minimization of variational free energy and expected free energy, respectively (Parr et al., 2022). These
computations are general, in the sense that they apply to any active inference agent. However, each
active inference agent can be equipped with a (task-specific) generative model and can thus evince
different behaviours. Below we introduce three active inference agents that face a driving task, one
using a generative model without cognitive control (Simulation 1), one using a simple model of meta-
cognitive control (Simulation 2), and one using a full model of meta-cognitive control (Simulation 3).
7
We simulate a driver whose goal is to travel safely from home to her office. The driving scenario
comprises 32 trials; see Figure 1A. At each trial, the agent receives a sensory cue from the environment
and selects one of two possible policies: she can either drive on the right lane along the road or drive
on the left lane. During our simulations, there is initially no danger and so the driver can safely follow
the policy to drive on the right lane in the usual (right) lane, which becomes increasingly habitual.
However, at some point the driver detects a danger (e.g., rockfall and stones on the right lane) and hence
— to avoid a collision — the driver needs to select the alternative policy drive on the left lane to reach
the unusual (left) lane. This task allows us to illustrate, in simple terms, the cognitive flexibility required
in classical cognitive control tasks such as the Stroop, the Posner and the Eriksen task (MacLeod, 1991;
Nee et al., 2007), and switching tasks (Kiesel et al., 2010; Monsell, 2003; Rubinstein et al., 2001).
The driving scenario showcases two circumstances in which the trade-off between habitual and
deliberative components of action arises: the passage from deliberative to habitual control of action in
stable and safe situations (habit formation) and the opposite passage from habitual to deliberative
control, in novel and dangerous situations (habit override). As we will see, the standard active inference
approach gracefully handles the first case, but not, or not always, the second case.
Figure 1: The driving scenario and the generative model for cognitive control. (A) Schematic
illustration of the driving scenario, with 32 trials. (B) Formal specification of the generative model of
the active inference agent, using the formalism of Partially Observable Markov Decision Processes
(POMDP). The nodes represent the agent's beliefs about states and task variables, which are encoded
as discrete probability distributions. The edges represent the statistical relations between these
variables, which are specified in the A, B, C, D and E matrices. (C) Schematic illustration of the 2
hidden states factors and the 3 observation modalities that define the driving scenario (and that are
faithfully reflected in the agent’s generative model). The hidden state factors comprise contexts that are
8
not controllable by the agent, “safety” and “danger”, which correspond to the presence or absence of
a danger (e.g., stones) on the road, and a controllable factor—so-called as the agent can decide which
one to reach at each trial: “on the right lane” is the state that the agent reaches whenever it drives on
the right lane, and “on the left lane” is the state that the agent reaches whenever it drives on the left
lane. Finally, the controllable hidden states comprise 1 ancillary (“start”) state, which represents the
starting state of the agent for each trial, before it makes its decision. The observations comprise 3 visual
observation modalities: the first modality represents either “free right lane” and “stones on the right
lane” which signal that the agent is in a “safety” or “danger”, respectively; the second modality
represents the options “start”, “on the left lane” and “on the right lane” which signal that the agent
is on the “start”, “on the left lane” or the “on the right lane” states, respectively. While observations
are distinct from hidden states, here — for simplicity — we assume a one-to-one mapping between them
(i.e., we assume that the A matrixes are identities), and represent them in a similar fashion. Of course,
this leaves open the possibility for making the two distinct (e.g., the observation corresponding to the
‘danger’ hidden state could be the sight of stones on the road ahead). Finally, the third observation
modality comprises 2 action outcomes, “OK” (positive outcome) and “KO” (negative outcome), which
depend on the specific combination of context and controllable states. The agent observes OK in two
cases: when the context is “safety” and the agent is “on the right lane” and when the context is
“danger” and the agent is “on the left lane”. Rather, the agent observes KO in the two alternative
cases: when the context is “safety” and the agent is “on the left lane” and when the context is “danger”
and the agent is “on the right lane”. This reflects the fact that when there is no danger, it is better to
drive on the right lane, whereas in the presence of danger, it is better to drive on the left lane. See the
main text for details.
The generative model to solve the driving task is shown in Figure 1B. It uses the formalism of Partially
Observable Markov Decision Processes (POMDP), with nodes S denoting hidden states (i.e., beliefs or
discrete probability distributions over unobservable task variables, that the agent does not see directly
but infers, based on observations, such as the agent's position and the presence of dangers); nodes O
denote observations (observable stimuli, from which the agent can infer hidden states); the node π
denotes beliefs about the policies (or action sequences); and edges represent probabilistic relations
among state variables (the letters A, B, C, D, E within the squares that mark the edges specify the
probabilistic mappings among variables).
For simplicity, we assume that the agent's generative model (shown in Figure 1B) faithfully represents
the “true” variables of the driving task and the statistical relations between them. The A matrix
represents the (likelihood) mapping between states and outcomes. The B matrix (transition priors)
9
encodes the probability of moving from one state to another. The C matrix encodes prior beliefs about
observations, which in active inference reflect prior preferences. The D vector encodes the prior about
the initial hidden state. The E vector encodes priors about policies. The G denotes expected free energy.
Finally, γ (and its prior value 𝛃𝟎 ) is a precision parameter associated with expected free energy, which
plays an important role in this setting, as it represents the cognitive control signal (and dopaminergic
activity). Note that the variational free energy F is implicit — as it underwrites state estimation during
inference — in contrast to expected free energy G that underwrites action selection explicitly. See Table
1 for a more detailed description of the model variables.
Symbol Mathematical
formulation
𝜋0: prior beliefs about policies 𝝅0 = σ(γ𝐆 + ln𝐄)
G: Expected free energy (used to score policies during planning). G𝜋 ln P(𝜋|C)
A: likelihood matrix that encodes the statistical relations between P(𝑜t | 𝑠t)
hidden states and outcomes.
B: transition matrix that encodes the statistical relations between P(𝑠t+1 | 𝑠t, 𝜋)
subsequent states.
C: matrix encoding prior preferences over observations (note that P(𝑜t|C)
observations with higher probabilities are more desired, e.g., the OK
outcome).
D: vector encoding the prior belief about initial hidden states. P(𝑠0|D)
E and e: vector encoding the prior belief about initial policies P(𝜋)
(before planning). e is the Dirichlet parameter used for learning.
𝛃 and γ: initial value of precision and precision parameter that 𝛃 = 1/γ
determines the relative contributions of habitual (E) and
deliberative (G) components of control.
As shown in Figure 1, the generative model comprises 2 hidden state factors (S) and 3 observation
modalities (O). The hidden state factors comprise contexts that are not controllable by the agent, safety
and danger, which correspond to the presence or absence of a danger (e.g., stones) on the road, and a
controllable factor—so-called as the agent can decide which state to transition to, at each trial: “on the
right lane” is the state that the agent reaches whenever it drives on the right lane, and “on the left lane”
is the state that the agent reaches whenever it drives on the left lane. Finally, the controllable hidden
states comprise 1 ancillary (start) state, which represents the starting state of the agent for each trial,
before it makes its decision. The observations comprise 3 visual observation modalities: the first
modality represents either free right lane and stones on the right lane which signal that the agent is in
safety or danger, respectively; the second modality represents the options start, on the left lane and on
the right lane which signal that the agent is on the start, on the left lane or the on the right lane states,
respectively. Finally, the third observation modality comprises 2 action outcomes, OK (positive
10
outcome) and KO (negative outcome), which depend on the specific combination of context and
controllable states. The agent observes OK in two cases: when the context is safety and the agent is on
the right lane and when the context is danger and the agent is on the left lane. Rather, the agent observes
KO in the two alternative cases: when the context is safety and the agent is on the left lane and when
the context is danger and the agent is on the right lane. This reflects the fact that when there is no
danger, it is better to drive in the usual (right) lane, whereas in the presence of danger (e.g., stones) on
the right lane, it is better to reach the unusual (left) lane. More formally, this means that the agent has a
prior preference for OK over KO observations in the C matrix.
Finally, for simplicity, the generative model only comprises two policies 𝜋, drive on the right lane and
drive on the left lane, both comprising a single action (as opposed to a sequence of actions as common
in active inference). For this reason, we will use the terms policy and action interchangeably.
Each trial is subdivided into two timesteps. At the first timestep, the driver starts from the start state
and from a context (danger or safety) that it needs to infer, based on the observation observations (stones
on the right lane or free right lane) she receives. At this first timestep, the driver selects one of its two
policies (drive on the right lane and drive on the left lane) and makes a transition to one of the two
controllable states (on the right lane or on the left lane). At the second timestep, the driver receives
sensory observations (on the right lane or on the left lane; OK or KO), which depend on her current
context and controllable state. Then, a new trial begins. Policy selection is described below. One can
think of this as a process of repeatedly deciding, at regular intervals, to stay (or switch) on the right or
deciding to switch (or stay) on the left.
At each trial, the decision between the two policies drive on the right lane and drive on the left lane
depends on the deliberative components of action (G), the habitual components of action (E), and the
balance between these two components determined by γ.
The deliberative component of action (G) corresponds to the online scoring of the quality of the two
policies, drive on the right lane and drive on the left lane, via their expected free energy. As illustrated
in Equation 2, the expected free energy (G) considers how well the two policies realize preferred
outcomes (encoded as prior probabilities over outcomes, C) and how much they resolve uncertainty
about hidden states.
In contrast, the habitual component of action (E) corresponds to a prior over policies. This prior is
learned over time, by accumulating the statistics of policy occurrences, via the underlying Dirichlet
parameters e. In other words, the agent sees herself behaving in a particular way and accrues habitual
11
priors as time proceeds. Crucially, at the beginning of our simulations, the prior E biases the agent
towards the most frequent policy to drive on the right lane.
The γ parameter is the precision estimate of beliefs about expected free energy (G). It represents the
confidence placed in the deliberative (G) component of action selection and determines its weighting
during action selection: the higher the precision γ of G, the greater the weight of deliberative control
over habitual control (E).
The relative contributions of these components are transformed into a (prior) probability distribution
over policies through a normalized exponential (softmax) function, as illustrated in Equation 3:
Evidence about the current observations is then gathered by scoring the variational free energy F =
F(𝜋). Because the free energy is conditioned upon policies, this effectively scores the evidence that the
agent is pursuing one policy or another. This evidence is then included in the equation to complete a
posterior over the policies:
Then, the precision parameter γ is optimized to minimize free energy. The requisite belief updates can
be expressed in terms of a prediction error; namely, the difference between the prior and the posterior
expectations of G.
1
β0 =
γ
β𝑢𝑝𝑑𝑎𝑡𝑒𝑑 = β − β0 + (⏟𝛑0 − 𝛑) ∗ 𝐆π
𝐺 𝑒𝑟𝑟𝑜𝑟
1
γ=
β𝑢𝑝𝑑𝑎𝑡𝑒𝑑
(5)
The last γ term is then used to calculate the posterior over the policies 𝜋. From this distribution, the
most probable policy is selected and executed (drive on the right lane or drive on the left lane) and the
agent moves to one of the two controllable states (on the right lane or on the left lane) and observes the
action outcome (OK or KO). Then, a new prior over policies 𝜋0, posterior over policies 𝜋 and γ terms
are computed, given the new observations, then a new trial commences. Importantly, the β𝑢𝑝𝑑𝑎𝑡𝑒𝑑 term
calculated in Equation 5 at any given trial t is used as the new β0 at the next trial t+1.
The precision updates here are slightly non-standard when compared to other active inference schemes,
where they would only be used for multistep policies that we can become progressively confident about
as we pursue them. However, they can be justified simply by interpreting this precision as a confidence
12
in our ability to choose a good policy, not in our specific choice of policy. In this way, this confidence
can carry over multiple trials for which—possibly very different—inferences may be drawn about the
same choice of actions.
As shown in Equation 5, the optimization of γ depends on the difference between the prior and the
posterior over policies, which in turn depends on the value of F, therefore reflecting the agent’s current
belief about the state of the world. In general, the precision γ reflects the agent’s confidence that the
selected policy reaches the preferred (OK) outcome, therefore it increases when the observation is OK
and decreases when the observation is KO. A consequence of this updating is that after a bad outcome
(KO), the precision γ decreases and then over time, the habitual component of action gains more
prominence compared to the deliberative component, as illustrated in the simulation below.
Simulation 1 results
Here, we simulate the driving scenario under the generative model described above. The results of the
simulation are illustrated in Figure 2. During the first 18 trials, the driver selects the policy to drive on
the right lane and at each trial and achieves the preferred OK observation (Figure 2A). During these
trials, the probabilities assigned to drive on the right lane by both the deliberative controller G and the
habitual controller E are very high (see the dark colours, denoting high probabilities in Figure 2F) and
there is no conflict between them (Figure 2D). Over time, the habitual component E becomes
increasingly stronger and automatic – i.e., habitisation (Figure 2F). It is worth noting that the pattern of
habit formation appears quite sharp, given that (for illustrative purposes) we used strong priors and high
learning rates. However, it is possible to consider that, in a realistic scenario, each trial could correspond
to (for example) half an hour of driving. This would render habit formation a slow process, as observed
empirically (MacLeod and Dunbar, 1988).
At trial 19, the driver detects a danger: the presence of stones on the right lane. By seeing the stones on
the right lane observation, the agent correctly infers that the context changed from safety to danger.
Consequently, the deliberative controller G assigns a greater probability to the policy to drive on the
left lane. However, the habitual controller E still assigns the greatest probability to the policy to drive
on the right lane as it is the policy that has been executed more frequently. Since the contribution of the
habit is stronger (𝑝γ𝐺 (𝜋) < 𝑝𝐸 (𝜋)), the selected policy is to drive on the right lane and the agent
receives the KO observation. As explained below, this in turn lowers the precision γ (Figure 2C),
creating a vicious circle that makes behaviour even more habitual. This exemplifies a situation in which
goal-directed behaviour cannot override strong habits, even if a correct task response has been
identified.
Figure 2B shows the Bayesian surprise registered by the agent during the driving task. Bayesian surprise
scores the change in probabilistic beliefs about states, before and after observing outcomes. Formally,
13
it is defined as a Kullback-Leibler (KL) divergence between probability distributions over hidden states,
across two consecutive timesteps within the same trial:
A high Bayesian surprise typically indicates an unexpected observation and implies that the agent’s
belief has to be updated accordingly, most likely requiring belief updating and a different course of
action (Behrens et al., 2007; Kumar et al., 2023; Sales et al., 2019). Bayesian surprise has been
associated with the activity of neuromodulatory systems that control plasticity, such as the
catecholaminergic locus coeruleus (LC) (Hennig et al., 2021; Jordan and Keller, 2023) and to high
arousal at the physiological level (Shine, 2023). In our simulation, the driver experiences a high
Bayesian surprise after trial 18, when the context changes from safety to danger.
14
Figure 2. Simulation of the driving scenario using standard active inference. The simulation illustrates
several key variables of cognitive control. These include (B) Bayesian surprise that scores the difference
in beliefs about states across time: a peak can be seen when the context changes at trial 19; (C) the
initial value of expected free energy precision 1/𝛽0 which increases when the observed outcome aligns
with the outcome predicted by the deliberative control; (D) cognitive conflict, that increases at trial 19
as the policy under deliberative control diverges from the policy under habitual control; (E) simulated
dopaminergic dynamics in the mesolimbic pathway, with positive (or negative) spikes associated with
an increase (or decrease) of the alignment between prior expectations and observations; (F) matrices
encoding the probabilities assigned by the deliberative controller 𝑝𝐺 (𝜋), the habitual controller 𝑝𝐸 (𝜋)
and the combined controller weighted by precision where darker shades indicate higher probabilities.
Notice that at the first trial, the two policies of the habitual controller 𝑝𝐸 (𝜋) have the same probability
of 0.5. However, the probability of the policy to “drive on the right lane” increases rapidly during the
15
subsequent trials. This habituation process is due to the accumulation of statistics of policy
occurrences, via the Dirichlet parameters e. See the main text for explanation.
Figure 2C shows the precision, which increases with time, as the agent becomes more confident in
achieving the desired (OK) observation, and decreases when the current observations does not align
with the expected observations (predicted OK outcome but KO is observed).
Figure 2D shows the degree of cognitive conflict of the active inference agent during the driving task.
Here, the cognitive conflict corresponds to the KL divergence between beliefs about the policy to pursue
under deliberative control and the policy to pursue under habitual control:
In this formulation, when the deliberative and the habitual controllers prioritize the same policy, there
is no cognitive conflict, whereas when they prioritize different policies, conflict can be high. In our
simulation, we observe a high cognitive conflict after trial 18, where the deliberative policy prioritizes
drive on the left lane, whereas the habitual controller prioritizes drive on the right lane. Note that there
is a strict relation between the two notions of cognitive conflict and cognitive cost. Intuitively, the habit
can be seen as a “default policy” or an initial bias about how to act. This means that the cognitive
conflict term reflects how much the deliberative model of an agent diverges from its initial bias, or as a
complexity cost (Rubin et al., 2012; Todorov, 2009; Zénon et al., 2019). Rather, pursuing a deliberative
policy that is the same as the habitual policy has no associated conflict, reflecting the assumption that
decision-makers are intrinsically biased towards low-effort options (Botvinick et al., 2009; Jimura et
al., 2010; Kool et al., 2010; Kool & Botvinick, 2014).
Figure 2E shows simulated dopaminergic activity in the mesolimbic pathway that originates in the VTA
and projects to the limbic system, particularly the nucleus accumbens, the amygdala, and the
hippocampus and which is associated with the processing of rewarding stimuli and the experience of
pleasure. In this account, simulated dopaminergic activity is associated with positive (or negative)
updates of the precision γ after each observation, which indexes to what extent each observation
increases (or decreases) the agent’s confidence in the policy it is pursuing (Friston et al., 2014; Langdon
et al., 2018; Schwartenbeck, FitzGerald, Mathys, Dolan, & Friston, 2015).
In keeping with neural implementations of active inference (Friston et al., 2017), the spikes are
simulated by considering the rate of change of the precision at each iteration of updating (here,
considering 16 iterations) at the second timestep of each trial.
16
𝑑𝛾 𝛾
Δ𝛿 = 8 ⋅ 𝑑𝑡 + 8 (8)
Here, Δδ represents the change in the dopaminergic signal that is modulated by the precision update.
𝑑𝛾
Specifically, δ denotes the dopaminergic signal, and 𝑑𝑡
is the derivative of the precision γ with respect
to the update iterations. The phasic dopamine responses hence reflect the (rate of change) of policy
precision, which depends on the accordance between predicted or desired outcomes and anticipated
outcomes. In our simulation, the negative spikes after trial 18 reflect the fact that after observing the
negative KO outcome, the value of the precision γ parameter decreases, and the agent loses confidence
in the course of action to drive on the left lane.
Figure 2F shows the probabilities assigned by the deliberative controller 𝑝𝐺 (𝜋), the habitual controller
𝑝𝐸 (𝜋) and the combined controller to the two policies of drive on the right lane (first rows) and drive
on the left lane (second rows) weighted by precision, during the task. Darker colours indicate higher
probabilities. The figure shows that the preference of the deliberative controller shifts from drive on the
right lane to drive on the left lane when the context changes from safety to danger at trial 19; the
preference of the deliberative controller remains fixed while that the agent develops a habit across trials.
In this simulation, we assume that the habit is stronger (𝑝γ𝐺 (𝜋) < 𝑝𝐸 (𝜋)) and hence the combined
controller assigns a higher probability to the policy drive on the right lane – producing a maladaptive
choice. Note that this simulation exemplifies the case of a strong habit and the result would have been
different if the habit were weaker, i.e., if (𝑝γ𝐺 (𝜋) > 𝑝𝐸 (𝜋)).
Summary of Simulation 1
In sum, this simulation shows an active inference agent who correctly reinforces habits (to drive on the
right lane), but cannot override entrenched habits when necessary. In this condition, called default
override (Botvinick et al., 2001, 2004), the deliberative component (G) correctly recognizes that the
task requires overriding a task-inappropriate response (Silton et al., 2010) but fails to do so because the
habitual response is too strong and “wins” the competition. In other words, the agent gets stuck in a
habitual mode. This is common in several situations, such as in motor control tasks. For example, for a
pilot, driving a car is an overlearned procedural task that uses a strong habitual control. When a
perturbation is introduced in the control task (for example, the steering direction of the wheel is inverted,
such that when turning right, the car steers to the left), there is a strong overriding of habitual control
that can lead to wrong responses even if the new tasks contingencies are acknowledged (Izawa et al.,
2008; Wei and Körding, 2009). However, in most everyday contexts, individuals can override habits,
even when these become strong, by engaging cognitive control (Cavanagh et al., 2013; De Martino et
17
al., 2006; Paus et al., 1993; Shenhav et al., 2013). In contrast, the model used in Simulation 1 would
only be capable of overriding weak habits.
A key reason why the model used in Simulation 1 gets stuck in habits is because—following poor
performance—the precision 𝛾 parameter decreases, reducing controlled processing. Computationally,
this emerges during the optimization of 𝛾, as described in Equation 5. Because we set a strong prior
over the habit E, free energy minimization entails a reduction in deliberate control. This mechanism
could account for cases in which feedback from the environment is insufficient to devalue the habit.
Empirical examples of these ‘vicious cycle’ dynamics—where negative outcomes reinforce habitual
behavioural patterns—can be found in domains such as emotional regulation and self-control. In
humans, an automatic or habitual anxiety response to stressful situations (e.g., negative outcomes) can
hinder deliberation, creating a self-reinforcing cycle, as seen in anxiety disorders. A similar pattern
occurs in addiction, where failed attempts to resist the addictive behaviour often lead to relapse,
reinforcing the habit further. (Heatherton and Wagner, 2011; Hofmann et al., 2012; Smith et al., 2020).
These internal conflicts, vicious cycles, and lapses in self-control are well explained through a Bayesian
lens, as they might arise as a result of bounded optimality (Hayden, 2018). However, in several instances
of cognitive control, both empirical studies (Gratton et al., 1992; Laming, 1968) and theoretical
accounts (Botvinick et al., 2001; Shenhav et al., 2013), demonstrate that errors typically increase rather
than decrease cognitive control. In the next simulation, we extend the model used here to include a
meta-cognitive control mechanism that resolves the above issue, and engages cognitive control when
necessary – by modulating the precision γ parameter adaptively.
Here, we augment the generative model of active inference shown in Figure 1B with a circuit for
(simple) meta-cognitive control level, see Figure 3. This new component is called meta-cognitive
control since it regulates one of the parameters of the behavioural-level controller discussed in
Simulation 1: the precision γ parameter by setting its prior value 𝛃𝟎 , which balances habitual and
deliberative components of action selection.
In the meta-cognitive control component, the parameter γ’ (having prior expectation 1/𝛃𝟎 ’) is akin to
the parameter γ (having prior expectation 1/𝛃𝟎): it constitutes the precision estimate of beliefs about
expected free energy of the goal directed controller G. This novel expected (indicated by bold) precision
parameter γ’ plays the role of a control signal (Shenhav et al., 2013) and of attentional resources
(Cooper and Shallice, 2000; Shallice and Burgess, 1993) and its main role is to prioritize deliberative
components of action selection, when useful.
18
As shown in Simulation 1, when there is cognitive conflict, a deliberative policy is only engaged if the
value of γ or G itself are strong enough to overcome E, i.e., if 𝑝𝛾𝐺 (𝜋) > 𝑝𝐸 (𝜋). This means that it
would be possible to overcome a strong habit by increasing the precision γ of the deliberative policy G.
However, as shown in Simulation 1, the precision γ tends to decrease rather than increase when there
is an adverse observation (KO). This is because the update of γ considers the difference between the
prior 𝜋0 and posterior 𝜋, which depends on current observations (via F). This mechanism prioritizes
(fast) habits in the presence of potential threats (LeDoux and Daw, 2018), but is inflexible.
Figure 3. Generative model for active inference with simple meta-cognitive control, used in Simulation
2. The generative model comprises two levels. The lower level (which is the same as Figure 1) is
responsible for behavioural control and comprises habitual (E) and deliberative (G) components of
action selection, which are balanced by the parameter γ. The higher level is responsible for one aspect
of meta-cognitive control (this is why we call it “simple”): it specifies a control signal (γ’) that sets the
prior value of the parameter γ of the lower-level behavioural control for the next trial. The figure
therefore highlights the crucial difference between deliberation (corresponding to G updates) and
cognitive control (corresponding to γ’ updates). See the main text for details.
19
Our proposed cognitive control model overcomes this limitation. In our proposal, cognitive control lies
at a high (or meta) level of the brain’s control hierarchy (Pezzulo et al., 2018b) and one of its key
functions is to specify a control signal that prioritizes deliberative control, when useful. Specifically,
cognitive control increases the precision γ by engaging prospection to consider fictive future
observations (i.e., OK outcomes) that one might gather by pursuing the deliberative policy G. There is
substantial evidence that imagining future events can render behaviour more deliberative, attenuating
the reward discounting observed in delay discounting experiments (Peters and Büchel, 2010) and
reducing impulsivity (Daniel et al., 2013). In keeping with these, we assume that simulating future,
positive outcomes under the deliberative policy G produces an optimism bias that increases the reliance
in the deliberative policy. Of course, in our system, this optimism bias only emerges when the
deliberative policy predicts that the preferred OK outcome will be observed.
The specification of a control signal corresponds to setting a new precision term, called γ’, which sets
the prior 𝛃𝟎 for the usual precision term γ, which in turn governs policy selection at the next trial,
according to Equation 4. Crucially, when the expected outcome under the deliberative policy is the
preferred one (the OK outcome), the prior value of γ will be increased by the γ’ updates, hence
prioritizing the deliberative policy.
π′0 = σ(γ′𝐆)
π′ = σ(𝐅′ + γ′𝐆)
1
β′0 =
γ′
1
= β0
γ′
The update equations for the γ’ parameter are similar to those of Equation 5, but with two main
differences. First, the prior and the posterior over the policy do not contain the value of the habitual
control E. Second, F’ is not the usual variational free energy F (which is calculated on the basis of the
actual observations) but a free energy calculated on the basis of fictive observations gathered by
engaging prospection. For example, in our scenario, at the end of every trial in which there is a danger,
our driver can gather fictive OK observations by engaging the G component, as if it was selecting drive
on the left lane. This form of prospection consists of eliciting fictive observations by simulating the
current trial (transitioning from the start state to either on the left lane or on the right lane) without
20
incorporating the habit term E in the policy computation. The cognitive control signal is then obtained
by performing four rounds of optimization of γ’ based on these simulated outcomes. This has the effect
of increasing the value of γ’ to a level that renders choice deliberative.
Note that while the standard method to set the precision in active inference (Equation 5) is retrospective
(i.e., it depends upon E and F), the novel proposed method (Equation 8) is prospective and depends on
future (fictive) observations expected when pursuing the deliberative policy (i.e., it depends upon G
and F’). This prospective form of γ optimization is related to the (optimistically) biased belief that an
active inference agent will engage in policies that are free energy minimizing. To prompt action in the
future, evidence about these present occurrences (E and F) must be ignored at the meta-cognitive
control level (for the relation between agency and precision see Friston, Samothrakis, et al., 2012;
Friston et al., 2013).
In summary, the cognitive control model proposed here uses a series of fictive observations, sampled
from the driver’s beliefs at the end of one trial, to imagine what the precision would have been—in the
absence of habitual beliefs—had these fictive samples been attained. The resulting precision is then
used as a prior for the precision during the next trial (i.e., using the reciprocal of the expected precision
under fictive observations as the beta parameter for the prior over precision for the next trial). The
rationale for this is that, if we can be confident (i.e., have high precision) in a policy that does not
include habits, then it makes sense to weight the contribution of this non-habitual policy by increasing
our precision going forwards. Both prospective and retrospective methods are used in the simulations
but the former method is used to set the prior value 𝛃𝟎 of the latter.
Results of Simulation 2
Here, we simulate the driving scenario using active inference with (simple) meta-cognitive control. The
results of the simulation are illustrated in Figure 4.
Like in Simulation 1, during the first 18 trials, there is no danger and the driver selects the policy to
drive on the right lane at each trial (Figure 4A), since this policy is supported by both deliberative G
and habitual controllers E. At trial 19, the agent observes stones on the right lane, experiences a high
Bayesian surprise (Figure 4B) and correctly infers that the context changed from safety to danger. At
this point, like in Simulation 1, the habitual controller suggests drive on the right lane, but the
deliberative controller suggests drive on the left lane, creating a cognitive conflict (Figure 4D).
21
This conflict enables cognitive control: fictive observations of the OK outcome drive the positive
updating of γ’ (Figure 4F), which in turn acts as the initial value of γ (1/β0) (Figure 4C). At trial 22, this
value becomes large enough to overcome the habit E and then the agent engages the deliberative plan
(Figure 4G), completing the task successfully.
Figure 4. Results of Simulation 2. Simulation of the driving scenario active inference with simple meta-
cognitive control. The simulation illustrates several key variables of cognitive control. These include
(B) Bayesian surprise that scores the difference in beliefs across time: a peak can be noticed when the
context changes at trial 19; (C) the initial value of expected free energy precision 1/𝛽0 which in a
situation of cognitive conflict increases thank to cognitive control signals. Notice how now, compared
to the previous simulation, 𝛽0 is determined by γ’; (D) cognitive conflict, that increases at trial 19 as
the policy under deliberative control diverges from the policy under habitual control. After the policy
“drive on the left lane” is selected at trial 22, conflict diminishes as the habit for “drive on the right
lane” gets devalued; (E) dopaminergic spikes in the mesolimbic pathway, that entails a positive (or
22
negative) spike when there is an increase (or decrease) when observations align (or misalign) with
prior expectations; (F) dopaminergic spikes in the mesocortical pathway, generated by updating γ’
through fictive observations and are used to set the initial value of γ; (G) matrixes encoding the
probabilities assigned by the deliberative controller 𝑝𝐺 (𝜋), the habitual controller 𝑝𝐸 (𝜋) and the
combined controller weighted by precision where darker colours indicate higher probabilities.
Figures 4E and 4F show the updating of the two precision parameters (γ and γ’) which we associate to
dopaminergic activity in two pathways, mesolimbic and mesocortical, respectively. The updates of the
precision parameter γ might be linked to mesolimbic dopaminergic activity, which has been implied in
incentive salience and the (expected) certainty that desired outcomes will be achieved (Berridge, 2012;
FitzGerald et al., 2015; Schwartenbeck et al., 2015a). In our simulation, the negative spikes around trial
18 reflect the decrease of certainty of the deliberative plan when the unpredicted KO outcome is
observed, whereas the positive spikes afterwards reflect an increase of certainty of the deliberative plan,
when the observed outcome is OK, in line with the deliberative component (Figure 4E).
The updates of the precision parameter γ’ might be linked to dopaminergic responses in the mesocortical
pathway. This pathway begins in the VTA and sends dopamine projections to the prefrontal cortex and
plays a crucial role in cognitive control (Brozoski et al., 1979; Cools et al., 2019; Sawaguchi and
Goldman-Rakic, 1991). In our simulation, dopaminergic activity in the mesocortical pathway is elicited
by the meta-cognitive control level and boosts deliberative control (Figure 4F). This perspective is in
keeping with the idea that dopaminergic activity plays a motivational role during cognitive control,
acting as a motivational modulator (Cools, 2016), producing an optimism bias for action (Sharot et al.,
2012), and influencing the willingness to engage effort (Aarts et al., 2008; Botvinick & Braver, 2015;
Padmala & Pessoa, 2011; Westbrook & Braver, 2016).
Summary of Simulation 2
In summary, Simulation 2 shows that engaging cognitive control — to increase the precision γ — is an
effective strategy to overcome strong habits, when cognitive conflicts arise. However, this simulation
only captures one aspect of cognitive control: namely, the specification of a control signal to prioritize
deliberative control (this is why we call it simple meta-cognitive control). This is done by allowing the
higher (meta-cognitive) level generative model to regulate the precision parameter γ of the lower
(behavioural) level generative model. The simulation makes two simplifying assumptions: first,
cognitive control is activated automatically when cognitive conflict is detected, rather than emerging
from the minimization of free energy under the appropriate generative model; second, engaging
cognitive control comes at no cost and the agent can increase precision indefinitely. Therefore, the
model from Simulation 2 is useful for explaining how cognitive control is engaged but not when or to
23
what extent. We next introduce a more comprehensive meta-cognitive control model — a full meta-
cognitive control model — that overcomes these limitations and offers a more expressive and plausible
account of cognitive control.
In this simulation, we implement meta-cognitive control as a free energy minimization process, using
a hierarchical generative model that comprises both a behavioural-level model and a meta-cognitive-
level model that monitors and regulates the parameters of the behavioural-level model (Figure 5).
Conceptually, this differs from typical hierarchical models in active inference. The formulation here
can be thought of as representing a brain in terms of two (or more) interacting ‘agents’. One of these
(the behavioural-level model) is the same as the agent of Simulation 1. The other (the meta-cognitive-
level model) is an agent who can observe the belief-updates of the first, using these as its data. It cannot
take action to change the outside world directly but can intervene on the prior beliefs of the first agent.
We unpack these as if they were two levels of a hierarchical model below—but it is important to note
that we could not invert this graphical model using standard Bayesian message passing schemes in the
manner of the models considered above.
The behavioural-level generative model is the same as in the previous simulations (i.e., the POMDP of
Figure 1), except for the fact that it comprises an additional state factor and associated observations (the
presence of a warning signal or its absence, no signal). This state factor informs about the probability
of observing stones on the right lane; see later.
24
Figure 5. Generative models for active inference for meta-cognitive control, used for Simulation 3.
There are two generative models here, one for lower-level behavioural control and one for meta-
cognitive control. These are implemented as separate (but reciprocally connected) POMDPs. The
behavioural control model is the same as in Figure 1, except for the fact that it includes one additional
state factor, which comprises the states warning signal and no signal and their associated observations.
The meta-cognitive control generative model is a separate POMDP model, composed of its own A, B,
C, D, and E matrices and policies and whose observations are derived from the results of inference at
the level below. The meta-cognitive-level model operates at each trial, in correspondence of every
second timestep of the behavioural-control model, and interacts reciprocally with it. See the main text
for details.
The meta-cognitive-level generative model is a separate POMDP model, composed of its own A, B, C,
D, and E matrices and policies. The meta-cognitive-level model makes decisions according to the
25
current beliefs of the behavioural-level model by operating at each trial, in correspondence of every
second timestep of the behavioural-level model. To consider the behavioural-level model beliefs, the
meta-cognitive-level model treats some aspects of the behavioural-level computation as a generative
process with hidden states to be inferred. Essentially, the meta-cognitive feature involves performing
active inference where the generative processes are aspects of the behavioural-level generative model.
More specifically, the aspects of the behavioural-level model that constitute a generative process at the
meta-cognitive-level are the computations of Bayesian surprise (Equation 6) and cognitive conflict
(Equation 7). These two KL divergences are continuous values that are discretised into two discrete
hidden states each. This is done by implementing the following mapping:
We make the simplifying assumption that the KL divergences lie within the range:
We divide this range into two intervals of equal lengths that will represent the discrete states.
𝐾𝐿𝑚𝑎𝑥 − 𝐾𝐿𝑚𝑖𝑛
𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙_𝑙𝑒𝑛𝑔𝑡ℎ = (11)
2
Then we use a normal distribution centred on the midpoint of each interval to transform the KL into
discrete states, to model the probabilities of the discrete states, given the KL divergence.
1 (𝐾𝐿− 𝜇𝑖 )2
𝑃 (𝑠𝑖 | 𝐾𝐿 ) = 𝜎√2𝜋 exp (− 2𝜎 2
) (13)
By reducing the standard deviation σ, by setting a high value on prc, a precision parameter that controls
how sharply the mapping concentrates the probability around the appropriate discrete state, the
precision of the mapping increases, concentrating the probabilities around the interval closer to the KL
value. In this setting we set prc as 5.
𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙_𝑙𝑒𝑛𝑔ℎ𝑡
𝜎= 𝑝𝑟𝑐
(14)
As a result, at the meta-cognitive level, we obtain two non-controllable state factors that depend on the
KL divergences at the behavioural level. One state factor contains the states: high cognitive conflict and
low cognitive conflict. The other state factor contains the states: high surprise and low surprise. The
observation modalities respectively entail the observations high cognitive conflict and low cognitive
conflict and high surprise and low surprise, respectively. In this way, the behavioural-level model
influences the meta-cognitive-level model by generating data whose causes must be inferred.
The meta-cognitive-level model also entails two state factors that are controllable. The states included
are: deliberation engaged or deliberation not engaged; and cognitive control engaged or cognitive
26
control not engaged and the associated outcome modalities include the observations: deliberation
engaged or deliberation not engaged; and cognitive control engaged or cognitive control not engaged.
Furthermore, the meta-cognitive-level model can select between three policies (all having length 1):
action 1: engaging both deliberation and cognitive control, action 2: engaging deliberation but not
cognitive control, or action 3 - neither deliberation nor cognitive control. Furthermore, it comprises an
additional outcome modality that reports the selected policy, with an identity A matrix that maps the
three policies to three observations called own action is 1, own action is 2, own action is 3.
The three policies at the meta-cognitive-level are selected by considering their expected free energy,
which we call G’ to distinguish it from the expected free energy G of the lower-level policies. The prior
preferences of the meta-cognitive-level model are encoded in its C matrix. These comprise a negative
preference for the two observations: deliberation engaged and cognitive control engaged. These
negative preferences reflect the assumption that decision-makers are biased towards low-effort options
— and the fact that engaging deliberation and cognitive control have associated cognitive costs
(Botvinick et al., 2009; Jimura et al., 2010; Kool et al., 2010; Kool and Botvinick, 2014). Furthermore,
the C matrix of the meta-cognitive-level comprises a positive preference for the OK-meta observation
and a negative preference for the KO-meta observation. The OK-meta and KO-meta observations play
a similar role as the OK and KO observations at the behavioural level model, as they correspond to
positive and negative preferences, but they are not generated by external sensations; rather, they are
generated by internal monitoring processes that consider whether the context (or more precisely, the
meta-cognitive-level belief about the context) afford or not afford cognitive control (see the
Discussion).
OK-meta is observed in three cases: first, when the meta-cognitive-level beliefs about the context are
high cognitive conflict and high surprise and those about the controllable states are cognitive control
engaged and deliberation engaged; second, when the meta-cognitive-level beliefs about the context are
low cognitive conflict and high surprise and those about the controllable states are cognitive control not
engaged and deliberation engaged; and third, when the meta-cognitive-level beliefs about the context
are low cognitive conflict and low surprise and those about the controllable states are cognitive control
not engaged and deliberation not engaged. KO-meta is observed in the other cases. The main function
of the OK-meta and KO-meta observations is prioritizing deliberation and cognitive control under the
appropriate circumstances. For this, during policy selection, the meta-cognitive-level considers the
trade-off between the preferred (OK-meta) observation versus the cognitive costs (or negative
preferences) associated with the two observations deliberation engaged and cognitive control engaged.
Numerically, the value of the preferred (OK-meta) observation is set to be greater than the cost of the
two aversive observations (deliberation engaged and cognitive control engaged), to allow the meta-
cognitive-level to resolve the trade-off effectively.
27
The meta-cognitive-level model influences the behavioural-level model by engaging in a Bayesian
model selection process, which selects between three possible behavioural-level models, having
different complexities. In the first model M1, the prior over the policy has the form π0 = 𝜎(ln𝐄 + 𝛾𝐆),
1
while the prior over the precision has form 𝑃 (𝛾|𝛾 ′ ) = Γ (1, 𝛾 ′ ). This means that this model includes
both the computation of G and the γ’ updates and hence it includes both deliberation and cognitive
control. In the second model M2, the prior over the policy has the form π0 = 𝜎(ln𝐄 + 𝛾𝐆), while the
prior over the precision has form 𝑃 (𝛾) = Γ(1, 𝛽0 ). This means that this model includes the computation
of G but not that of γ’ updates and hence it engages deliberation but not cognitive control. Finally, in
the third model, M3 the policy prior has the form π0 = 𝜎(ln𝐄), and the prior over the precision has
form 𝑃 (𝛾) = Γ(1, 𝛽0 ). Therefore, this model does not include neither the computations of G nor the γ’
updates and hence it does not engage neither deliberation nor cognitive control.
The Bayesian model selection considers two elements. The former is a prior over the three behavioural-
level models, which we set to be flat for simplicity:
1 1 1
𝑃(𝑀𝑜𝑑𝑒𝑙𝑠𝐵𝑒ℎ𝑎𝑣𝑖𝑜𝑢𝑟𝑎𝑙 𝑙𝑒𝑣𝑒𝑙 ) = [3 , 3 , 3] (15)
The latter is a likelihood 𝑃(𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠𝑂𝑤𝑛 𝐴𝑐𝑡𝑖𝑜𝑛 | 𝑀𝑜𝑑𝑒𝑙𝑠𝐵𝑒ℎ𝑎𝑣𝑖𝑜𝑢𝑟𝑎𝑙 𝑙𝑒𝑣𝑒𝑙 ) , which maps from
(observations about) the actions selected by the meta-cognitive-level model to the three possible
behavioural level models. For simplicity, here we assume an identity matrix, in which the probabilities
of M1, M2 and M3 are set to 1 when the (observations about) the actions selected by the meta-cognitive-
level controller are own action is 1, own action is 2, and own action is 3, respectively.
A posterior distribution over the three behavioural-level models is obtained using standard Bayesian
inference, by combining the prior and the likelihood and the behavioural-level model having the highest
probability is selected. Hence, effectively, the meta-cognitive-level decisions about engaging or not
engaging deliberation and cognitive control are translated into the selection of different behavioural-
level models that engage or not engage deliberation and cognitive control.
To sum up, in this scheme the behavioural and meta-cognitive levels influence each other reciprocally,
as if they were two ‘agents’ in a multiagent setting. The meta-cognitive-level continuously infers and
monitors the presence of surprise and cognitive conflict at the behavioural level. At the same time, the
meta-cognitive-level selects the model to be used at the behavioural level, by selecting between models
that compute or not compute G and γ’ updates.
28
Results of Simulation 3
At the beginning of the task, the context is safety and the driver adopts the policy to drive on the right
lane, which rapidly becomes a habit (Figures 6A and 6H). At this point, the deliberative component is
not engaged, to save cognitive resources (Figures 6G and 6H).
However, at trial 12 the driver encounters a surprising warning signal (Figure 6A), which is a cue for
the fact that the context could change from safety to danger. The warning signals elicits the selection
of a policy to move to the deliberation engaged state at the meta-cognitive control level, which in turn
engages the deliberative controller G at the behavioural level (Figure 6H) at a slightly increased mental
effort and dAAC activity (Figure 6G). At trial 19, the driver observes stones on the right lane. At this
point, not only the driver experiences high Bayesian surprise (Figure 6B) but she also experiences a
high cognitive conflict (Figure 6D), because the habitual and the deliberative components suggest two
different policies (Figure 6H). The simultaneous presence of Bayesian surprise and cognitive conflict
elicits the selection of the policy to engage both deliberation and cognitive control at the meta-
cognitive-level. This, in turn, elicits a control signal (Figure 6F) that prioritizes the deliberative
component of action selection (Figure 6C), determining a correct change of lane to avoid the danger
(Figures 6A and 6H).
29
Figure 6. Results of Simulation 3. Simulation of the driving scenario in active inference with full meta-
cognitive control. The simulation illustrates several key variables of meta-cognitive control. These
include (B) Bayesian surprise that scores the difference in beliefs across time: a peak can be noticed at
trial 12 when the “warning signal” is observed. This triggers the meta-cognitive-level to move to the
state “deliberation engaged” and compute expected free-energy at the behavioural level. Furthermore,
another peak of Bayesian surprise can be observed when the context changes at trial 19; (C) the initial
30
value of expected free energy precision 1/𝛽0 which in a situation of cognitive conflict increases thanks
to cognitive control signals; (D) cognitive conflict, that increases at trial 19 as the policy under
deliberative control diverges from the policy under habitual control. After the policy “drive on the left
lane” is selected at trial 22, conflict diminishes as the habit for “drive on the right lane” gets devalued;
(E) dopaminergic spikes in the mesolimbic pathway, that entails a positive (or negative) spike when
there is an increase (or decrease) when observations align (or misalign) with prior expectations; (F)
dopaminergic spikes in the mesocortical pathway, generated by updating γ’ through fictive observations
and are used to set the initial value of γ. In contrast to the previous simulation, here the updates are not
deployed indefinitely; (G) mental effort and simulated dACC activity scores the divergence of the
prioritized deliberative model γG and the habit E and the cognitive costs of diverging from the preferred
observations “deliberation not engaged” and “cognitive control not engaged”. It increases when the
state “deliberation engaged” is selected at trial 12 and increases further at trial 19 when the state
“cognitive control engaged is selected” and γG diverges from the habit E; (H) matrices encoding the
probabilities assigned by the deliberative controller 𝑝𝐺 (𝜋), the habitual controller 𝑝𝐸 (𝜋) and the
combined controller weighted by precision where darker colours indicate higher probabilities.
Figures 6E and 6F shows simulated dopaminergic activity in mesolimbic and mesocortical pathways,
respectively, during the task. Mesocortical dopaminergic activity reflects the increase of the control
signal γ’ over time, when the policy to engage cognitive control is selected at the meta-cognitive-level
(Figure 6F). Mesolimbic dopaminergic activity follows a more complex profile: at trial 19 — when the
driver sees stones on the right lane — mesolimbic dopaminergic activity shows negative spikes,
reflecting the fact that the driver has lost confidence (as encoded in the precision of policies γ) that her
course of action will be adequate to achieve desired (OK) observations. However, when cognitive
control is engaged, the prior of these negative peaks increases until the driver’s confidence is high
enough to change the observed outcome from KO to OK at trial 22. These new outcomes create then
positive peaks of mesolimbic dopaminergic activity that boost precision further (Figure 6E).
Figure 6G plots the dynamics of mental effort during the task, which is necessary to exert cognitive
control. In our simulation, the driver engages (a small amount of) mental effort when she observes the
traffic sign at trial 12, in order to engage the deliberative controller. Later on, starting from trial 19, the
driver engages (a larger amount of) mental effort, because she needs to deploy a control signal γ’ in
order to prioritize the deliberative policy. In contrast to the previous simulation, here these updates are
not deployed indefinitely, but influenced by the presence of conflict.
The decision about whether or not to engage cognitive control is made by the meta-cognitive-level, by
balancing its benefit (in terms of increased probability to achieve the desired outcome OK-meta) and
its cognitive costs. The overall mental effort is quantified as follows:
31
Mental effort = D𝐾𝐿 [ ⏟𝐺 (𝜋|(𝛾))
𝑝 || ⏟𝐸 (𝜋) ] + ⏟
𝑝 D𝐾𝐿 [𝑄(𝑜3,4 |𝜋) || 𝑃(𝑜3,4 )]
𝑃𝑟𝑖𝑜𝑟𝑖𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑑𝑒𝑙𝑖𝑏𝑒𝑟𝑎𝑡𝑒 𝐻𝑎𝑏𝑖𝑡𝑢𝑎𝑙 𝐶𝑜𝑠𝑡 𝑜𝑓 𝑐𝑜𝑛𝑡𝑟𝑜𝑙
𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑝𝑟𝑖𝑜𝑟
(16)
The former term reflects the cost to deviate from habitual priors about policies (E) (Rubin et al., 2012;
Todorov, 2009; Zénon et al., 2019) and it is scored as a KL divergence between the prioritized
deliberative model γG and the habit E, see also Parr et al., 2023.
The cost of control reflects the fact that the driver has a prior preference for the two low-cost
observations deliberation not engaged and cognitive control not engaged over the two high-cost
observations deliberation engaged and cognitive control engaged (Figure 5). The cost of control is
scored as the KL divergence between the probability of obtaining these outcomes (written as 𝑜3,4) with
and without engaging cognitive control and is linked to the risk component of the expected free energy
(Equation 2). This cost of control decreases the probability to (or, in other words, renders the agent less
certain that she will) engage deliberative control G and updating γ’. Note that while (for simplicity) we
keep the costs of control fixed, in realistic settings they could accumulate with time, as an effect of
fatigue (Botvinick et al., 2009; Green & Myerson, 2004; Sozou, 1998). Furthermore, we sidestep other
possible cognitive costs associated to task requirements, such as the necessity to engage prospection, to
maintain in working memory a representation of the risky context, the number of γ’ updates, and others,
which might entail further cognitive and metabolic demands. Finally, cognitive costs associated with
the more complex behavioural-level models could also be included in the priors used for the Bayesian
model selection of Equation 14.
At the neurophysiological level, we associate mental effort to neuronal dynamics in the dorsal Anterior
Cingulate Cortex (dACC). An influential proposal suggests that the dorsal anterior cingulate cortex
(dACC) integrates information about costs and rewards, to compute the net value associated with the
allocation of control to a given task, to determine whether and how much control should be invested,
and ultimately, to deploy the optimized control signal (Shenhav et al., 2013; see also (Badre and
Wagner, 2004; Botvinick et al., 2001; Callaway et al., 2022, 2021; Grahek et al., 2020; Musslick et al.,
2015). In keeping, a large body of evidence indicates that dACC is active in conditions that require
adjustments in control intensity and that it influences brain structures responsible for regulation. For
example, various neuroimaging studies suggest a relation between dACC activity during conflicts and
the subsequent increment of activity in areas associated with attentional regulation (Cavanagh and
Frank, 2014; Kerns, 2006; Kerns et al., 2004; King et al., 2010; MacDonald et al., 2000). Similar
evidence comes from EEG studies, where electrophysiological indicators of dACC responses have been
32
detected during attentional adjustments after conflict and errors (Aarts et al., 2008; Carter et al., 1998;
Crottaz-Herbette and Menon, 2006; Forster et al., 2011; Shenhav et al., 2013; Sohn et al., 2007).
The simulation presented here could be extended to cover longer periods, in which safety and threat
contexts alternate. Crucially, the capability of the model to flexibly adapt to a novel danger depends on
its state—specifically, its current cognitive conflict and precision levels—when the novel danger
appears. If a novel danger appears during a period in which these levels are still elevated—as a
consequence of previous danger—the model would be able to engage deliberative processes faster.
Conversely, if the novel danger appears when the model has already established a (new or old) habit,
then it would have no advantage. For simplicity, in our simulations we have assumed that habits quickly
re-established. This is why cognitive conflict returns almost to baseline levels shortly after a goal-
directed action is selected (see Figure 6D). However, one might consider that—under the assumption
that an initial danger could be predictive of subsequent dangers—it might be adaptive to have a low
learning rate for novel habits. This would maintain relatively high levels of cognitive control even after
the first danger disappears. Alternatively, one could consider generative models with explicit (expected)
transitions between “safe” versus “dangerous” contexts.
Summary of simulation 3
This simulation shows how full meta-cognitive control — here, derived from free energy minimization
— allows a driver to conserve cognitive resources (by avoiding engaging the deliberative controller
when unnecessary), rely on habits when no surprises are encountered, and to suspend them through
cognitive control, when behaviour needs to be adapted to a new (risky) context. This model explains
how, when, and to what extent an agent needs to exert cognitive control. Furthermore, this simulation
illustrates the importance of warning signals or cues to pre-empt unexpected and aversive outcomes
(Gabriel and Orona, 1982). In cognitive control, cues provide predictive information about impending
changes in policy, increasing the demand for controlled processing and facilitating behavioural switches
(Kiesel et al., 2010; Monsell, 2003). In our simulations, the cue is unable to directly influence action
selection at the behavioural level, as its influence over decision probabilities is negligible due to the
strength of the habit. Instead, the cue represents a surprising event that influences decisions at the meta-
cognitive level, triggering the engagement of deliberate planning. This can be appreciated by noticing
that in Figure 6H, deliberate processing is absent before the cue is observed at trial 12. Importantly,
even if the cue does not directly trigger a specific behaviour, it renders subsequent deliberative action
selection faster. This can be appreciated by noticing that, if the cue were not present, deliberative
processing would only become active after the stones on the right lane were observed at trial 19,
requiring at least one trial in the context of danger before deliberation could be engaged. Given that
also the subsequent shift from habitual to deliberative processing — thanks to the engagement of
33
cognitive control — takes time, the driver would be slower in changing lanes in the absence of a cue.
This example foregrounds the prospective aspect of the cue and the fact that it influences action
selection indirectly, by influencing the meta-cognitive decision to engage (or not engage) deliberation.
At the neurophysiological level, the cost-benefits computation required to decide whether or not (and
how much) to engage cognitive control can be associated with the dACC (Shenhav et al., 2013).
Furthermore, the deployment of a control signal can be associated with mesocortical dopaminergic
activity, which in turn influences mesolimbic dopaminergic activity which, once predicted outcomes
are observed, signals an increased confidence in one’s course of action.
To more systematically test the effectiveness of meta-cognitive control, we compared the performance
of the full meta-cognitive control model (used in Simulation 3) with that of the active inference model
without cognitive control (used in Simulation 1), using different parameterizations. To ensure a fair
comparison, we set the warning signal state—used only by the meta-cognitive control model—to no
signal. We did not include the model used in Simulation 2, since it could be considered a special case
of Simulation 3, in which cognitive control is triggered automatically under the appropriate conditions
and the effort cost is set unrealistically to zero. We performed three sets of 40 simulations, each varying
one of the three parameters that play a crucial role in the cognitive control process:
The learning rate of the habit, which determines how strongly the prior over the habit increases
across trials. The parameter range is η = 1, 2, …, 40.
The prior on precision, which defines the initial balance between habits and deliberation at trial
1. The parameter range is β = 1.2, 1.18, …, 0.1.
The precision of the preferences, which influences how strongly motivated the agent is to
achieve an "OK" outcome. The parameter range is c = 0.8, 0.84, …, 2.4.
We measured performance by calculating the percentage of OK outcomes in trials where danger was
present, requiring the agent to devalue the habit to avoid it: see Figure 7. Figure 7A shows the results
of each of the three sets of simulations, illustrating that the full meta-cognitive control outperforms the
model without cognitive control. Figure 7B shows the performance of the models using full meta-
cognitive control and without cognitive control—for each of the 40 parameter values varied for each
simulation—with darker colours indicating better performance. The results show that the model with
full meta-cognitive control achieves a good performance in a large region of parameter space, with the
two main exceptions being high levels of habit learning (that make behaviour inflexible) and low levels
of precision of preferences (that make the model insensitive to outcomes). Conversely, the model
without cognitive control is only effective in a smaller region of parameter space. In sum, these
34
simulations show that the full meta-cognitive control model is more effective and robust compared to
the model without cognitive control.
Figure 7. Performance comparison between the model without cognitive control and with full meta-
cognitive control. Performance is measured by the percentage of "OK" outcomes achieved over 40
simulation iterations across various parameter ranges. The parameters varied were: the learning rate
of the habit, which determines how strongly the prior over the habit is reinforced across trials. The
parameter range is η = 1, 2, …, 40; The prior on precision, which defines the initial balance between
habits and deliberation at trial 1. The parameter range is β = 1.2, 1.18, …, 0.1; the precision of the
preferences, which influences how strongly motivated the agent is to achieve an "OK" outcome. The
parameter range is c = 0.8, 0.84, …, 2.4. The table shows that the full meta-cognitive model consistently
achieves a greater percentage of "OK" outcomes compared to the model without cognitive control. In
the grid plot below, each square represents a simulation for a given parameter value, with darker tints
indicating a higher percentage of "OK" outcomes, and therefore, better performance. See the main text
for explanation.
Discussion
In the cognitive sciences, we often distinguish between habitual and goal-oriented selection of
behaviour. Goal-oriented behaviour can become habitual over time, through repeated performance of
that behaviour. Under certain circumstances, acquired habits may turn out to be maladaptive. Cognitive
35
control is the name given to the process of monitoring performance and conflicts between habitual and
goal-oriented behaviours and, where necessary, suppressing maladaptive habits to rebalance towards
goal-orientation. We presented a novel theory of cognitive control within the framework of active
inference, which explains how cognitive control enables acting beyond default behaviours, by
optimizing a cognitive control signal (at the meta-cognitive level) that in turn prioritizes deliberative
components of control over habits (at the behavioural level). In turn, the optimization of the cognitive
control signal requires prospection and the simulation of future positive evidence, which rends cognitive
control both future oriented and effortful. For ease of explanation, we first introduced a simple meta-
cognitive control model that only captures reduced aspects of cognitive control – namely, the
specification of a control signal to prioritize deliberative control – and then a full meta-cognitive control
scheme in which cognitive control stems from free energy minimization and requires effort.
Our simulations illustrate that when performing repetitive tasks, such as driving, it is possible to move
from a more demanding (deliberative) form of control to a less demanding (habitual) form of control,
and then back to deliberative control, when needed. While the former change from deliberative to
habitual forms of control (or habit formation) emerges naturally in previous implementations of active
inference (Friston et al., 2016; Maisto et al., 2019), the latter change from habitual to deliberative forms
of control requires multiagent hierarchical processing: a hierarchical generative model in which the
higher meta-cognitive-level can monitor the beliefs and influence some parameters (e.g., Bayesian
surprise, cognitive conflict, precision) of the lower behavioural-level. In turn, this hierarchical
processing exemplifies the deep connections between our proposal and three prominent theories of
cognitive control: attention to action (Norman and Shallice, 1986), expected value of control (Shenhav
et al., 2013) and the performance monitoring framework (Alexander and Brown, 2011). The first theory
assumes that a (higher level) supervisory attentional system deploys cognitive control to bias action
selection; the second theory assumes that cognitive control is based on a cost-benefit evaluation that
balances the payoff — that one can obtain by engaging a controlled process — against its cognitive
costs; and the third theory emphasizes the importance of monitoring prediction error signals that result
from the comparison between expected and actual action outcomes. Our proposal is conceptually
related to the above theories, but casts the computations of cognitive control within the framework of
active inference (Mittenbühler et al., 2024; Parr et al., 2022; Schwöbel et al., 2021). Our simulations
illustrate the interplay of various cognitive variables and mechanisms that have been identified as
crucial for cognitive control, such as surprise and volatility, context monitoring, confidence in one’s
course of actions, the specification of a control signal, regulation, mental effort, cognitive conflict and
the costs of control (Botvinick et al., 2001; Kool et al., 2010; Laming, 1968; Rabbitt, 1966; Shenhav et
al., 2013). For example, during a driving task performed habitually, monitoring Bayesian surprise
(Figure 6B) permits engaging prospective processes associated with mental effort (Figure 6G); in turn,
this permits eventually detecting cognitive conflicts (Figure 6D), which engage control signals (Figure
36
6C) that prioritize deliberative components of action selection (Figure 6C); hence, avoiding maladaptive
effects of habits.
Crucially, while the above cognitive variables and mechanisms are often studied and modelled
independently, they all straightforwardly emerge under the free energy minimization scheme of active
inference and reciprocally influence each other. In other words, in our approach, cognitive control
emerges as a straightforward extension of the action-perception loop of active inference — it is just
another thing that is optimized by minimizing free energy — and does not require heterogeneous
mechanisms for optimization, as in previous proposals. Another advantage of our proposal is that the
expected free energy functional that is minimized is richer than the notion of value or reward considered
in expected value of control (Shenhav et al., 2013), as it also encompasses epistemic factors, such as
the value of resolving environmental uncertainty and ambiguity, which are highlighted in other studies
(Behrens et al., 2007). This is because the expected free energy functional used to select amongst
policies automatically balances pragmatic (utility-maximization) and epistemic (uncertainty-
minimization) imperatives (Friston et al., 2017).
Finally, our conceptualization of full meta-cognitive control (Simulation 3)—a process that emerges
from the interaction of multiple agents engaging in active inference by observing one another—
reconciles the roles of two key processes implied in cognitive control and dACC functioning; namely,
monitoring Bayesian surprise and cognitive conflict. In our proposal, the monitoring of Bayesian
surprise is always active and it is key for the decision to engage or not engage deliberation. This
continuous monitoring ensures that even when action is under habitual control, with little “attention to
action”, it is possible to detect circumstances (e.g., signs of danger) that make deliberation important.
Rather, the monitoring of cognitive conflict is only possible when deliberation is engaged. This process
engages cognitive control (if there is conflict) or avoid doing so, and hence save cognitive resources (if
there is no conflict). The importance of monitoring both Bayesian surprise and cognitive conflict has
been often recognized in cognitive control and dACC functioning and our model provides a novel
perspective on how to harmonize the role of these processes within a unifying framework.
Our cognitive control models introduce two technical novelties compared to standard implementations
of active inference; namely, the prospective simulation and sampling of future outcomes over multiple
trials to select priors and (in the meta-cognitive control model) a multiagent modelling of brain function
in which one (meta-cognitive) agent can observe some aspects of another (behavioural) agent —
namely, the computations of Bayesian surprise and cognitive conflict — and intervene upon it to
covertly change its functioning. We argue that modelling cognitive control necessitates introducing
these novelties. The prospective simulation mechanism reflects the assumption that cognitive control
has a future-oriented aspect (Pezzulo, 2012); this assumption is shared by the expected value of control
(EVC) model, proposing that the model simulates possible outcomes and chooses the control signal that
maximizes simulated outcomes (Grahek et al., 2020; Musslick et al., 2015). In other words, to escape
37
learned habits, it is often necessary to imagine future events; a process that has been shown to increase
deliberation, reduce impulsivity (Daniel et al., 2013), and attenuate the discount of future rewards
(Peters and Büchel, 2010; Pezzulo and Rigoli, 2011). In turn, engaging in prospection and future
simulation entails a cognitive cost, which makes cognitive control effortful, which explains why under
certain circumstances, we succumb to maladaptive habits. The multiagent modelling reflects the
assumption of “meta-cognitive” and “meta-learning” models in cognitive science and deep learning that
the highest level (or meta) model monitors the functioning of the lowest level model and changes its
parameters over time, to optimize its functioning during cognitive tasks (Boldt et al., 2019; De Martino
et al., 2013; Fleming and Daw, 2017) or its learning capabilities (Botvinick et al., 2019; Silvetti et al.,
2023; Wang et al., 2018). The meta-cognitive process on when to deliberate is also similar to models
of optimal information sampling (Callaway et al., 2022, 2021). From a generative modelling
perspective, the meta-cognitive and the behavioural generative models could be considered as two
interacting ‘agents’ that inform, and are informed by, each other.
The model presented in this study has several limitations that need to be addressed in future studies.
First, the generative models are designed by hand and they only include very few states, observations,
and policies. While our implementation is appropriate for illustrative purposes, future studies should
consider the learning of richer generative models for cognitive control in realistic settings, which is
feasible by using standard statistical learning methods of active inference (Binz et al., 2023) or
approaches from robotics (Da Costa et al., 2022; Lanillos et al., 2021; Pio-Lopez et al., 2016; Taniguchi
38
et al., 2023). Another limitation regards the way we addressed the trade-offs between the costs and
benefits of engaging deliberation and cognitive control in the meta-cognitive-level model. For
simplicity, we assumed that some internal monitoring process (playing the role of generative process
for the meta-cognitive-level model) ensures that the preferred (OK-meta) observations are generated,
whenever the appropriate conditions for cognitive control are met; for example, when the meta-
cognitive-level beliefs about the context are high cognitive conflict and high surprise and those about
controllable states are cognitive control engaged and deliberation engaged. However, we did not fully
specify the nature and functioning of the internal monitoring process. Furthermore, we manually set the
numerical value of preferred (OK-meta) observations to be greater than the costs of engaging
deliberation and cognitive control, in order to ensure that cognitive control is effectively deployed when
useful. However, determining the conditions in which deploying cognitive resources and cognitive
control is adaptive is a central concern in theories of meta-cognitive control, bounded rationality and
value of control. For example, some theories consider that both the behavioural and the meta-cognitive
levels maximize the same quantity (e.g., reward) whereas other consider that these are separate
optimization processes (Bénon et al., 2024; Gershman et al., 2015; Keramati et al., 2011; Pezzulo et al.,
2013; Shenhav et al., 2013). Future studies might formalize the internal monitoring process that sets
preferred (OK-meta) observations and their value as an additional mechanism — e.g., a third agent in
our multiagent setting — that learns to recognize the most appropriate conditions in which deploying
cognitive control is adaptive, perhaps after a maturation process during adolescence (Luna et al., 2015).
Alternatively, one might model the meta-cognitive-level as a Markov Decision Process (MDP), not as
a Partially Observable Markov Decision Process (POMDP). In this latter case, there is no need for an
external monitoring process, since the meta-cognitive-level is (so to say) its own generative process and
the hidden states are interpretable directly as brain states, in contrast to most active inference models in
which brain states are the beliefs about the hidden states. This solution would remove the need for an
external monitoring process, but would imply that the meta-cognitive-level cannot be wrong about (for
example) experiencing surprise or cognitive conflict. Furthermore, we assume that the two models can
communicate and we implement the requisite exchange in a simple and direct manner. To facilitate
information exchange, we use a simplified representation of neural and cognitive dynamics. In future
studies, we hope to model how this exchange takes place—specifically, how the dynamics of Bayesian
surprise and cognitive conflict influence cognitive control parameters. Another limitation of the study
is short time series and the binary way of how events unfold in our simulated environment—
transitioning from a state of safety to a state of danger. In future studies, we aim to demonstrate the
flexibility of the metacognitive agent in longer, more naturalistic scenarios where safety and threat
evolve dynamically. Another notable limitation of this study is the lack of direct empirical grounding
in well-established benchmarks of cognitive control research. For ease of exposition, we selected the
driving scenario, which helped illustrate cognitive control dynamics in a practical, everyday context.
Future research should strengthen the model’s empirical grounding—and establish its broader
39
applicability to cognitive control research—by evaluating it against classic behavioral findings such as
the Stroop effect, congruency sequence effects, and proportion congruency manipulations. A critical
comparison would be between models like ours that assume prospective aspects of cognitive control,
such as the prospective simulation of fictive observations, with those that only assume retrospective
aspects. Another critical comparison would be with alternative models that assign to the dACC distinct
roles in (for example) computing surprise or cognitive conflict signals. These and other model
comparisons would help validate the assumptions of the model presented here and suggest useful
extensions. Another limitation concerns our simplified simulation of neural activity. Following standard
applications of active inference, we assumed that neural populations encode expectations of hidden
states as average firing rates (Friston et al., 2017). For example, we simulated neural activity associated
with cognitive effort (and the dACC) by combining components of the behavioural and metacognitive
beliefs and posteriors. This allowed us to simulate dACC dynamics at the individual level, as a function
of belief updating. However, our simplified simulations are unlikely to capture the full range of dACC
activations across cognitive tasks, which have been variously linked to value, surprise and other
variables, see (Vriens et al., 2025) for a recent discussion. Testing the plausibility of our simulations of
the dACC (and of other brain circuits) deserves further exploration. Finally, another limitation is the
fact that we only considered binary choices between engaging or not engaging deliberation and
cognitive control. Assuming a simple dichotomy between “automatic” and “controlled” processes
might be too restrictive (Hommel, 2019; Neys, 2023; Pezzulo et al., 2013). A more nuanced perspective
is that there might be a continuum, along which a given process can be more or less automatic relative
to another, often evidenced by the interference one process exerts on another. A compelling
demonstration of this principle is the fact that extensively training participants on a novel task made
them able to reverse the usual interference pattern, thereby illustrating the malleability of automaticity
(MacLeod and Dunbar, 1988). Future studies might consider more graded choices in which (for
example) the most appropriate levels of deliberation and cognitive control are selected.
The free energy minimization dynamics have straightforward correspondences with neuronal responses
in various brain areas (Bastos et al., 2012; Da Costa et al., 2021; Friston, 2005; Friston et al., 2017,
2006; Isomura, 2021; Parr et al., 2022). Below, we discuss on the putative neuronal underpinnings of
the proposed model that are key to cognitive control: the dopaminergic (DA) system, the dorsal anterior
cingulate cortex (dACC), and the locus coeruleus (LC).
The dopaminergic system. According to the active inference framework, dopamine plays a central role
in belief updating, as it encodes the meaningful information content of observations (FitzGerald et al.,
40
2015; Friston et al., 2014; Friston, Shiner, et al., 2012; Schwartenbeck, FitzGerald, et al., 2015) which
under certain conditions reflects reward prediction error (Cohen et al., 2012; D’Ardenne et al., 2008;
Flagel et al., 2011; Schultz, 1998; Schultz et al., 1997). More specifically, dopaminergic activity is
related to precision signals, which reflect the confidence that one is pursuing a course of action that will
lead to one’s preferred observations: in line with the idea that dopamine is key into the pursuit of goals
(Friston et al., 2013; Hesp et al., 2021).
The hierarchical generative model illustrated in this study links the role of dopamine as a signal for goal
pursuit to its role for cognitive effort (Cools et al., 2019). It does this by presenting two precision signals,
namely, the usual precision γ of the policies at the lower (behavioural) level and a novel precision term
γ’ that corresponds to the control signal selected at the higher (meta-cognitive) level of control. The
different optimizations of γ and γ’ allows us to distinguish dopaminergic activity in the mesolimbic and
the mesocortical pathways, respectively. In keeping with previous active inference studies (FitzGerald
et al., 2015; Friston et al., 2014; Friston, Shiner, et al., 2012; Schwartenbeck, FitzGerald, Mathys,
Dolan, Kronbichler, et al., 2015), we argue that the updating of the precision γ after gathering actual
observations (via the free energy F) can be associated with the mesolimbic dopaminergic pathway. This
update indexes of how fast the agent is becoming more (or less in case of negative updates) confident
in her choices (Friston et al., 2014; Langdon et al., 2018; Schwartenbeck, FitzGerald, Mathys, Dolan,
& Friston, 2015). This mechanism reflects dopaminergic activity in the mesolimbic pathway that
originates in the VTA and projects to the limbic system, particularly the nucleus accumbens, amygdala,
and hippocampus. This pathway is associated with the processing of rewarding stimuli, the experience
of pleasure and motivation (Braver et al., 2014): from stimulus-rewards Pavlovian associations (Swart
et al., 2017) to cognitive cost-benefit analysis and reinforcement learning (Salamone et al., 2016a,
2016b; Schultz et al., 1997; Tobler et al., 2005), and putatively by promoting learning based on reward
prediction errors (Montague et al., 1996).
Rather, the updating of the precision γ’ after gathering fictive observations (via the free energy F’)
could be associated with the mesocortical dopamine pathway. This assumption reflects the fact that
dopaminergic activity plays a role in modulating prefrontal areas from the midbrain (Brozoski et al.,
1979; Cools, 2011; Cools et al., 2019; Sawaguchi and Goldman-Rakic, 1991). Dopamine could act as
motivational modulator that influences value-related computations, for the selection of adaptive
behaviour (Cools, 2016, 2008). It may modulate the value of decisions, promoting cognitive effort to
achieve better outcomes (Aarts et al., 2008; Botvinick and Braver, 2015; Iodice et al., 2017; Padmala
and Pessoa, 2011; Westbrook et al., 2020; Westbrook and Braver, 2016). This support the idea of
cognitive control as the modulation of the optimistic biases for action: fictive observations are simulated
to allow dopamine to increase our confidence in the fact that we will minimize free energy (Fisher et
al., 2025.; Sharot et al., 2012).
41
The dorsal anterior cingulate cortex (dACC). In our proposal, we associated neuronal dynamics in the
(dACC) to deployment of mental effort, which combines the specification of control signals with the
estimation of costs and habits (Badre & Wagner, 2004; Botvinick et al., 2001). This proposal reflects
the idea that dACC integrates information about costs and rewards, computes the net value associated
with the allocation of control to a given task, determines how much control should be invested, and
ultimately deploys the optimized control signal (Shenhav et al., 2013). Other related proposals link
dACC to two key processes of cognitive control: the monitoring of conflict, volatility and the
probability of undesired outcomes (Botvinick et al., 2001); and the specification of a control signal
(Shenhav et al., 2013). Furthermore, the dACC could be indirectly involved in a third key process of
cognitive control: it could deploy information about conflicts to other brain areas, such as the lateral
prefrontal cortex (lPFC), which are responsible for regulation. A large body of evidence indicates that
dACC is active in conditions that require adjustments in control intensity and that it influences brain
structures responsible for regulation. For example, various neuroimaging studies suggest a relation
between dACC activity during conflicts and the subsequent increment of activity in areas associated
with attentional regulation (Cavanagh and Frank, 2014; Kerns, 2006; Kerns et al., 2004; King et al.,
2010; MacDonald et al., 2000). Similar evidence comes from EEG studies, where electrophysiological
indicators of dACC responses have been detected during attentional adjustments after conflict and
errors (Aarts et al., 2008; Carter et al., 1998; Crottaz-Herbette and Menon, 2006; Forster et al., 2011;
Shenhav et al., 2013; Sohn et al., 2007). Finally, other proposals emphasize the involvement of the
dACC in Bayesian surprise computations (Alexander and Brown, 2011; Vassena et al., 2020) and meta-
learning (Silvetti et al., 2018). While adjudicating between (or reconciling) these theories of dACC is
beyond the scope of this paper, we note that our model generates quantitative predictions about various
variables potentially associated with dACC activity, such as Bayesian surprise (Figure 6B), cognitive
conflict (Figure 6D), and mental effort (Figure 6G), which could be directly compared with neural
activations in cognitive control tasks.
The locus coeruleus (LC) and noradrenaline. The locus coeruleus (LC) in one of the main sources of
noradrenaline (NA) in the brain is implied in the processing of information about statistical regularities,
such as surprise, uncertainty, and volatility of the environment (Behrens et al., 2007; Sales et al., 2019).
In our proposal, the LC could be responsible for calculating Bayesian surprise: the LC could deploy
information about Bayesian surprise to the dACC, via cortico-LC connections, as this information is
essential to initiate adjustments of the level of control in the dACC. For example, high levels of
Bayesian surprise lead to changes to expected free energy G influencing the dynamics in γ. Second, a
high level of Bayesian surprise could promote the release of noradrenaline, therefore increasing learning
42
rate and fostering faster model updating. A complementary perspective is that noradrenaline promotes
reward-based learning. In this perspective, there would be a hierarchy of neuromodulators:
noradrenaline (which lies higher in the hierarchy) would promote the release of dopamine (which lies
lower in the hierarchy), which in turn would mediate faster learning via reward prediction errors
(Silvetti et al., 2018).
In sum, we provided possible links between the computations of the active inference scheme — and in
particular, those required to optimize a control signal — and neuronal dynamics in the dopaminergic
(DA) system, dorsal anterior cingulate cortex (dACC), and the locus coeruleus (LC). Our proposal
suggests a coherent brain circuit that supports the various facets of cognitive control. For example, a
putative process of “increased of attention to action” could start when conditions of high Bayesian
surprise and cognitive conflict are detected (via interactions between LC and dACC). This, in turn,
would increase learning rate (via noradrenaline) and the control signal (in the dACC), which would
change the weight assigned to different behavioural controllers (in the lateral prefrontal hierarchy);
possibly, prioritizing a more deliberative mode of behaviour by involving dopaminergic circuits and
higher hierarchical levels. Active inference provides a principled framework to formalize these
subprocesses parsimoniously (i.e., in terms of the optimization of a single, free energy functional).
However, as emphasized in our discussion, the ways these processes relate to brain activity is debated;
for example, the computation of surprise signals has been linked to both the dACC (Alexander and
Brown, 2011; Vassena et al., 2020) and the LC (Behrens et al., 2007; Sales et al., 2019). Reconciling
these and other alternative proposals is an open objective for future studies.
43
Acknowledgements
This research received funding from the European Research Council under the Grant Agreement No.
820213 (ThinkAhead), the Italian National Recovery and Resilience Plan (NRRP), M4C2, funded by
the European Union – NextGenerationEU (Project IR0000011, CUP B51E22000150006, “EBRAINS-
Italy”; Project PE0000013, “FAIR”; Project PE0000006, “MNESYS”), and the Ministry of University
and Research, PRIN PNRR P20224FESY and PRIN 20229Z7M8N to G.P: the Wellcome Centre for
Human Neuroimaging (Ref: 205103/Z/16/Z) to K.F., a Canada–UK Artificial Intelligence Initiative
(Ref: ES/T01279X/1) to K.F. TP is supported by an NIHR Academic Clinical Fellowship (ref: ACF-
2023-13-013). The funders had no role in study design, data collection and analysis, decision to publish,
or preparation of the manuscript.
44
References
Aarts, E., Roelofs, A., Turennout, M. van, 2008. Anticipatory Activity in Anterior Cingulate Cortex
Can Be Independent of Conflict and Error Likelihood. J. Neurosci. 28, 4671–4678.
https://doi.org/10.1523/JNEUROSCI.4400-07.2008
Alexander, W.H., Brown, J.W., 2011. Medial prefrontal cortex as an action-outcome predictor. Nature
Neuroscience 14, 1338–1344. https://doi.org/10.1038/nn.2921
Anderson, J.R., 1982. Acquisition of cognitive skill. Psychological Review 89, 369–406.
https://doi.org/10.1037/0033-295X.89.4.369
Badre, D., 2008. Cognitive control, hierarchy, and the rostro–caudal organization of the frontal lobes.
Trends in Cognitive Sciences 12, 193–200. https://doi.org/10.1016/j.tics.2008.02.004
Badre, D., Wagner, A.D., 2004. Selection, Integration, and Conflict Monitoring: Assessing the Nature
and Generality of Prefrontal Cognitive Control Mechanisms. Neuron 41, 473–487.
https://doi.org/10.1016/S0896-6273(03)00851-1
Balleine, B.W., Dickinson, A., 1998. Goal-directed instrumental action: contingency and incentive
learning and their cortical substrates. Neuropharmacology 37, 407–419. https://doi.org/10.1016/S0028-
3908(98)00033-1
Bastos, A.M., Usrey, W.M., Adams, R.A., Mangun, G.R., Fries, P., Friston, K.J., 2012. Canonical
microcircuits for predictive coding. Neuron 76, 695–711. https://doi.org/10.1016/j.neuron.2012.10.038
Behrens, T.E.J., Woolrich, M.W., Walton, M.E., Rushworth, M.F.S., 2007. Learning the value of
information in an uncertain world. Nat Neurosci 10, 1214–1221. https://doi.org/10.1038/nn1954
Bénon, J., Lee, D., Hopper, W., Verdeil, M., Pessiglione, M., Vinckier, F., Bouret, S., Rouault, M.,
Lebouc, R., Pezzulo, G., Schreiweis, C., Burguière, E., Daunizeau, J., 2024. The online metacognitive
control of decisions. Commun Psychol 2, 1–17. https://doi.org/10.1038/s44271-024-00071-y
Berridge, K.C., 2012. From prediction error to incentive salience: mesolimbic computation of reward
motivation. Eur J Neurosci 35, 1124–1143. https://doi.org/10.1111/j.1460-9568.2012.07990.x
Binz, M., Dasgupta, I., Jagadish, A.K., Botvinick, M., Wang, J.X., Schulz, E., 2023. Meta-learned
models of cognition. Behavioral and Brain Sciences 1–38.
Boldt, A., Blundell, C., De Martino, B., 2019. Confidence modulates exploration and exploitation in
value-based learning. Neuroscience of consciousness 2019, niz004.
Botvinick, M., Braver, T., 2015. Motivation and cognitive control: from behavior to neural mechanism.
Annu Rev Psychol 66, 83–113. https://doi.org/10.1146/annurev-psych-010814-015044
45
Botvinick, M., Ritter, S., Wang, J.X., Kurth-Nelson, Z., Blundell, C., Hassabis, D., 2019.
Reinforcement Learning, Fast and Slow. Trends in Cognitive Sciences 23, 408–422.
https://doi.org/10.1016/j.tics.2019.02.006
Botvinick, M.M., 2007. Conflict monitoring and decision making: reconciling two perspectives on
anterior cingulate function. Cognitive, Affective, & Behavioral Neuroscience 7, 356.
Botvinick, M.M., Braver, T.S., Barch, D.M., Carter, C.S., Cohen, J.D., 2001. Conflict monitoring and
cognitive control. Psychol Rev 108, 624–652. https://doi.org/10.1037/0033-295x.108.3.624
Botvinick, M.M., Cohen, J.D., Carter, C.S., 2004. Conflict monitoring and anterior cingulate cortex: an
update. Trends Cogn Sci 8, 539–546. https://doi.org/10.1016/j.tics.2004.10.003
Botvinick, M.M., Huffstetler, S., McGuire, J.T., 2009. Effort discounting in human nucleus accumbens.
Cogn Affect Behav Neurosci 9, 16–27. https://doi.org/10.3758/CABN.9.1.16
Braver, T.S., Krug, M.K., Chiew, K.S., Kool, W., Westbrook, J.A., Clement, N.J., Adcock, R.A., Barch,
D.M., Botvinick, M.M., Carver, C.S., Cools, R., Custers, R., Dickinson, A., Dweck, C.S., Fishbach, A.,
Gollwitzer, P.M., Hess, T.M., Isaacowitz, D.M., Mather, M., Murayama, K., Pessoa, L., Samanez-
Larkin, G.R., Somerville, L.H., MOMCAI group, 2014. Mechanisms of motivation-cognition
interaction: challenges and opportunities. Cogn Affect Behav Neurosci 14, 443–472.
https://doi.org/10.3758/s13415-014-0300-0
Brozoski, T.J., Brown, R.M., Rosvold, H.E., Goldman, P.S., 1979. Cognitive deficit caused by regional
depletion of dopamine in prefrontal cortex of rhesus monkey. Science 205, 929–931.
https://doi.org/10.1126/science.112679
Callaway, F., Rangel, A., Griffiths, T.L., 2021. Fixation patterns in simple choice reflect optimal
information sampling. PLOS Computational Biology 17, e1008863.
https://doi.org/10.1371/journal.pcbi.1008863
Callaway, F., van Opheusden, B., Gul, S., Das, P., Krueger, P.M., Griffiths, T.L., Lieder, F., 2022.
Rational use of cognitive resources in human planning. Nat Hum Behav 6, 1112–1125.
https://doi.org/10.1038/s41562-022-01332-8
Carter, C.S., Braver, T.S., Barch, D.M., Botvinick, M.M., Noll, D., Cohen, J.D., 1998. Anterior
cingulate cortex, error detection, and the online monitoring of performance. Science 280, 747–749.
https://doi.org/10.1126/science.280.5364.747
Cavanagh, J.F., Eisenberg, I., Guitart-Masip, M., Huys, Q., Frank, M.J., 2013. Frontal Theta Overrides
Pavlovian Learning Biases. J. Neurosci. 33, 8541–8548. https://doi.org/10.1523/JNEUROSCI.5754-
12.2013
46
Cavanagh, J.F., Frank, M.J., 2014. Frontal theta as a mechanism for cognitive control. Trends in
Cognitive Sciences 18, 414–421. https://doi.org/10.1016/j.tics.2014.04.012
Cohen, J.Y., Haesler, S., Vong, L., Lowell, B.B., Uchida, N., 2012. Neuron-type-specific signals for
reward and punishment in the ventral tegmental area. Nature 482, 85–88.
https://doi.org/10.1038/nature10754
Cools, R., 2016. The costs and benefits of brain dopamine for cognitive control. WIREs Cognitive
Science 7, 317–329. https://doi.org/10.1002/wcs.1401
Cools, R., 2011. Dopaminergic control of the striatum for high-level cognition. Current Opinion in
Neurobiology, Behavioural and cognitive neuroscience 21, 402–407.
https://doi.org/10.1016/j.conb.2011.04.002
Cools, R., 2008. Role of Dopamine in the Motivational and Cognitive Control of Behavior.
Neuroscientist 14, 381–395. https://doi.org/10.1177/1073858408317009
Cools, R., Froböse, M., Aarts, E., Hofmans, L., 2019. Dopamine and the motivation of cognitive
control. Handb Clin Neurol 163, 123–143. https://doi.org/10.1016/B978-0-12-804281-6.00007-0
Cooper, R., Shallice, T., 2000. Contention scheduling and the control of routine activities. Cognitive
Neuropsychology 17, 297–338. https://doi.org/10.1080/026432900380427
Crottaz-Herbette, S., Menon, V., 2006. Where and when the anterior cingulate cortex modulates
attentional response: combined fMRI and ERP evidence. J Cogn Neurosci 18, 766–780.
https://doi.org/10.1162/jocn.2006.18.5.766
Da Costa, L., Lanillos, P., Sajid, N., Friston, K., Khan, S., 2022. How Active Inference Could Help
Revolutionise Robotics. Entropy 24, 361. https://doi.org/10.3390/e24030361
Da Costa, L., Parr, T., Sengupta, B., Friston, K., 2021. Neural Dynamics under Active Inference:
Plausibility and Efficiency of Information Processing. Entropy (Basel) 23, 454.
https://doi.org/10.3390/e23040454
Daniel, T.O., Stanton, C.M., Epstein, L.H., 2013. The Future Is Now: Reducing Impulsivity and Energy
Intake Using Episodic Future Thinking. Psychol Sci 24, 2339–2342.
https://doi.org/10.1177/0956797613488780
D’Ardenne, K., McClure, S.M., Nystrom, L.E., Cohen, J.D., 2008. BOLD responses reflecting
dopaminergic signals in the human ventral tegmental area. Science 319, 1264–1267.
https://doi.org/10.1126/science.1150605
Daw, N.D., Gershman, S.J., Seymour, B., Dayan, P., Dolan, R.J., 2011. Model-based influences on
humans’ choices and striatal prediction errors. Neuron 69, 1204–1215.
47
https://doi.org/10.1016/j.neuron.2011.02.027
Daw, N.D., Niv, Y., Dayan, P., 2005. Uncertainty-based competition between prefrontal and
dorsolateral striatal systems for behavioral control. Nat Neurosci 8, 1704–1711.
https://doi.org/10.1038/nn1560
De Martino, B., Fleming, S.M., Garrett, N., Dolan, R.J., 2013. Confidence in value-based choice. Nat
Neurosci 16, 105–10. https://doi.org/10.1038/nn.3279
De Martino, B., Kumaran, D., Seymour, B., Dolan, R.J., 2006. Frames, Biases, and Rational Decision-
Making in the Human Brain. Science 313, 684–687. https://doi.org/10.1126/science.1128356
Dezfouli, A., Balleine, B.W., 2012. Habits, action sequences and reinforcement learning. European
Journal of Neuroscience 35, 1036–1051. https://doi.org/10.1111/j.1460-9568.2012.08050.x
Dezfouli, A., Lingawi, N.W., Balleine, B.W., 2014. Habits as action sequences: hierarchical action
control and changes in outcome value. Philos Trans R Soc Lond B Biol Sci 369, 20130482.
https://doi.org/10.1098/rstb.2013.0482
Dolan, R.J., Dayan, P., 2013. Goals and habits in the brain. Neuron 80, 312–325.
https://doi.org/10.1016/j.neuron.2013.09.007
Dorfman, H.M., Gershman, S.J., 2019. Controllability governs the balance between Pavlovian and
instrumental action selection. Nat Commun 10, 5826. https://doi.org/10.1038/s41467-019-13737-7
Doya, K., 2002. Metalearning and neuromodulation. Neural Netw 15, 495–506.
Fisher, E.L., Whyte, C.J., Hohwy, J., n.d. An Active Inference Model of the Optimism Bias. Comput
Psychiatr 9, 3–22. https://doi.org/10.5334/cpsy.125
FitzGerald, T.H.B., Dolan, R.J., Friston, K., 2015. Dopamine, reward learning, and active inference.
Front Comput Neurosci 9, 136. https://doi.org/10.3389/fncom.2015.00136
Flagel, S.B., Clark, J.J., Robinson, T.E., Mayo, L., Czuj, A., Willuhn, I., Akers, C.A., Clinton, S.M.,
Phillips, P.E.M., Akil, H., 2011. A selective role for dopamine in reward learning. Nature 469, 53–57.
https://doi.org/10.1038/nature09588
Fleming, S.M., Daw, N.D., 2017. Self-Evaluation of Decision-Making: A General Bayesian Framework
for Metacognitive Computation. Psychol Rev 124, 91–114. https://doi.org/10.1037/rev0000045
Forster, S.E., Carter, C.S., Cohen, J.D., Cho, R.Y., 2011. Parametric manipulation of the conflict signal
and control-state adaptation. J Cogn Neurosci 23, 923–935. https://doi.org/10.1162/jocn.2010.21458
Friston, K., 2010. The free-energy principle: a unified brain theory? Nat Rev Neurosci 11, 127–138.
https://doi.org/10.1038/nrn2787
48
Friston, K., 2005. A theory of cortical responses. Philos Trans R Soc Lond B Biol Sci 360, 815–836.
https://doi.org/10.1098/rstb.2005.1622
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O’Doherty, J.P., Pezzulo, G., 2016. Active
inference and learning. Neuroscience & Biobehavioral Reviews 68, 862–879.
https://doi.org/10.1016/j.neubiorev.2016.06.022
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G., 2017. Active Inference: A
Process Theory. Neural Computation 29, 1–49. https://doi.org/10.1162/NECO_a_00912
Friston, K., Kilner, J., Harrison, L., 2006. A free energy principle for the brain. J Physiol Paris 100, 70–
87. https://doi.org/10.1016/j.jphysparis.2006.10.001
Friston, K., Samothrakis, S., Montague, R., 2012a. Active inference and agency: optimal control
without cost functions. Biol Cybern 106, 523–541. https://doi.org/10.1007/s00422-012-0512-8
Friston, K., Schwartenbeck, P., Fitzgerald, T., Moutoussis, M., Behrens, T., Dolan, R., 2013. The
anatomy of choice: active inference and agency. Frontiers in Human Neuroscience 7.
Friston, K., Schwartenbeck, P., FitzGerald, T., Moutoussis, M., Behrens, T., Dolan, R.J., 2014. The
anatomy of choice: dopamine and decision-making. Philosophical Transactions of the Royal Society B:
Biological Sciences 369, 20130481. https://doi.org/10.1098/rstb.2013.0481
Friston, K., Shiner, T., FitzGerald, T., Galea, J.M., Adams, R., Brown, H., Dolan, R.J., Moran, R.,
Stephan, K.E., Bestmann, S., 2012b. Dopamine, Affordance and Active Inference. PLoS Comput Biol
8, e1002327. https://doi.org/10.1371/journal.pcbi.1002327
Gabriel, M., Orona, E., 1982. Parallel and serial processes of the prefrontal and cingulate cortical
systems during behavioral learning. Brain Research Bulletin 8, 781–785. https://doi.org/10.1016/0361-
9230(82)90107-1
Gershman, S.J., Horvitz, E.J., Tenenbaum, J.B., 2015. Computational rationality: A converging
paradigm for intelligence in brains, minds, and machines. Science 349, 273–278.
https://doi.org/10.1126/science.aac6076
Grahek, I., Musslick, S., Shenhav, A., 2020. A computational perspective on the roles of affect in
cognitive control. International Journal of Psychophysiology 151, 25–34.
https://doi.org/10.1016/j.ijpsycho.2020.02.001
Gratton, G., Coles, M.G.H., Donchin, E., 1992. Optimizing the use of information: Strategic control of
activation of responses. Journal of Experimental Psychology: General 121, 480–506.
https://doi.org/10.1037/0096-3445.121.4.480
Green, L., Myerson, J., 2004. A Discounting Framework for Choice With Delayed and Probabilistic
49
Rewards. Psychol Bull 130, 769–792. https://doi.org/10.1037/0033-2909.130.5.769
Hayden, B.Y., 2018. Why has evolution not selected for perfect self-control? Philosophical
Transactions of the Royal Society B: Biological Sciences 374, 20180139.
https://doi.org/10.1098/rstb.2018.0139
Heatherton, T.F., Wagner, D.D., 2011. Cognitive Neuroscience of Self-Regulation Failure. Trends
Cogn Sci 15, 132–139. https://doi.org/10.1016/j.tics.2010.12.005
Hennig, J.A., Oby, E.R., Golub, M.D., Bahureksa, L.A., Sadtler, P.T., Quick, K.M., Ryu, S.I., Tyler-
Kabara, E.C., Batista, A.P., Chase, S.M., Yu, B.M., 2021. Learning is shaped by abrupt changes in
neural engagement. Nat Neurosci 24, 727–736. https://doi.org/10.1038/s41593-021-00822-8
Hesp, C., Smith, R., Parr, T., Allen, M., Friston, K.J., Ramstead, M.J.D., 2021. Deeply Felt Affect: The
Emergence of Valence in Deep Active Inference. Neural Comput 33, 398–446.
https://doi.org/10.1162/neco_a_01341
Hofmann, W., Schmeichel, B.J., Baddeley, A.D., 2012. Executive functions and self-regulation. Trends
Cogn Sci 16, 174–180. https://doi.org/10.1016/j.tics.2012.01.006
Hommel, B., 2019. Binary Theorizing Does Not Account for Action Control. Front. Psychol. 10.
https://doi.org/10.3389/fpsyg.2019.02542
Iodice, P., Ferrante, C., Brunetti, L., Cabib, S., Protasi, F., Walton, M.E., Pezzulo, G., 2017. Fatigue
modulates dopamine availability and promotes flexible choice reversals during decision making.
Scientific Reports 7, 535.
Isomura, T., 2021. Active inference leads to Bayesian neurophysiology. Neuroscience Research.
https://doi.org/10.1016/j.neures.2021.12.003
Izawa, J., Rane, T., Donchin, O., Shadmehr, R., 2008. Motor Adaptation as a Process of Reoptimization.
J. Neurosci. 28, 2883–2891. https://doi.org/10.1523/JNEUROSCI.5359-07.2008
Jimura, K., Locke, H.S., Braver, T.S., 2010. Prefrontal cortex mediation of cognitive enhancement in
rewarding motivational contexts. Proc Natl Acad Sci U S A 107, 8871–8876.
https://doi.org/10.1073/pnas.1002007107
Jordan, R., Keller, G.B., 2023. The locus coeruleus broadcasts prediction errors across the cortex to
promote sensorimotor plasticity. eLife 12, RP85111. https://doi.org/10.7554/eLife.85111
Kahneman, D., Treisman, A., Burkell, J., 1983. The cost of visual filtering. Journal of Experimental
Psychology: Human Perception and Performance 9, 510–522. https://doi.org/10.1037/0096-
50
1523.9.4.510
Keramati, M., Dezfouli, A., Piray, P., 2011. Speed/accuracy trade-off between the habitual and the goal-
directed processes. PLoS Comput Biol 7, e1002055. https://doi.org/10.1371/journal.pcbi.1002055
Kerns, J.G., 2006. Anterior cingulate and prefrontal cortex activity in an FMRI study of trial-to-trial
adjustments on the Simon task. Neuroimage 33, 399–405.
https://doi.org/10.1016/j.neuroimage.2006.06.012
Kerns, J.G., Cohen, J.D., MacDonald, A.W., Cho, R.Y., Stenger, V.A., Carter, C.S., 2004. Anterior
cingulate conflict monitoring and adjustments in control. Science 303, 1023–1026.
https://doi.org/10.1126/science.1089910
Kiesel, A., Steinhauser, M., Wendt, M., Falkenstein, M., Jost, K., Philipp, A.M., Koch, I., 2010. Control
and interference in task switching--a review. Psychol Bull 136, 849–874.
https://doi.org/10.1037/a0019842
King, J.A., Korb, F.M., von Cramon, D.Y., Ullsperger, M., 2010. Post-error behavioral adjustments are
facilitated by activation and suppression of task-relevant and task-irrelevant information processing. J
Neurosci 30, 12759–12769. https://doi.org/10.1523/JNEUROSCI.3274-10.2010
Koechlin, E., Ody, C., Kouneiher, F., 2003. The architecture of cognitive control in the human
prefrontal cortex. Science 302, 1181–1185. https://doi.org/10.1126/science.1088545
Koechlin, E., Summerfield, C., 2007. An information theoretical approach to prefrontal executive
function. Trends Cogn Sci 11, 229–35. https://doi.org/10.1016/j.tics.2007.04.005
Kool, W., Botvinick, M., 2014. A labor/leisure tradeoff in cognitive control. J Exp Psychol Gen 143,
131–141. https://doi.org/10.1037/a0031048
Kool, W., McGuire, J.T., Rosen, Z.B., Botvinick, M.M., 2010. Decision making and the avoidance of
cognitive demand. J Exp Psychol Gen 139, 665–682. https://doi.org/10.1037/a0020198
Kumar, M., Goldstein, A., Michelmann, S., Zacks, J.M., Hasson, U., Norman, K.A., 2023. Bayesian
Surprise Predicts Human Event Segmentation in Story Listening. Cognitive Science 47, e13343.
https://doi.org/10.1111/cogs.13343
Laming, D.R.J., 1968. Information theory of choice-reaction times, Information theory of choice-
reaction times. Academic Press, Oxford, England.
Langdon, A.J., Sharpe, M.J., Schoenbaum, G., Niv, Y., 2018. Model-based predictions for dopamine.
Current Opinion in Neurobiology, Neurobiology of Behavior 49, 1–7.
https://doi.org/10.1016/j.conb.2017.10.006
51
Lanillos, P., Meo, C., Pezzato, C., Meera, A.A., Baioumy, M., Ohata, W., Tschantz, A., Millidge, B.,
Wisse, M., Buckley, C.L., Tani, J., 2021. Active Inference in Robotics and Artificial Agents: Survey
and Challenges. https://doi.org/10.48550/arXiv.2112.01871
LeDoux, J., Daw, N.D., 2018. Surviving threats: neural circuit and computational implications of a new
taxonomy of defensive behaviour. Nat Rev Neurosci 19, 269–282. https://doi.org/10.1038/nrn.2018.22
Lee, S.W., Shimojo, S., O’Doherty, J.P., 2014. Neural computations underlying arbitration between
model-based and model-free learning. Neuron 81, 687–699.
https://doi.org/10.1016/j.neuron.2013.11.028
Luna, B., Marek, S., Larsen, B., Tervo-Clemmens, B., Chahal, R., 2015. An Integrative Model of the
Maturation of Cognitive Control. Annu Rev Neurosci 38, 151–170. https://doi.org/10.1146/annurev-
neuro-071714-034054
MacDonald, A.W., Cohen, J.D., Stenger, V.A., Carter, C.S., 2000. Dissociating the role of the
dorsolateral prefrontal and anterior cingulate cortex in cognitive control. Science 288, 1835–1838.
https://doi.org/10.1126/science.288.5472.1835
MacLeod, C.M., 1991. Half a century of research on the Stroop effect: An integrative review.
Psychological Bulletin 109, 163–203. https://doi.org/10.1037/0033-2909.109.2.163
MacLeod, C.M., Dunbar, K., 1988. Training and Stroop-like interference: evidence for a continuum of
automaticity. Journal of Experimental Psychology: Learning, memory, and cognition 14, 126.
Maisto, D., Friston, K., Pezzulo, G., 2019. Caching mechanisms for habit formation in Active Inference.
Neurocomputing 359, 298–314. https://doi.org/10.1016/j.neucom.2019.05.083
McClelland, J.L., Rumelhart, D.E., 1981. An interactive activation model of context effects in letter
perception: I. An account of basic findings. Psychological Review 88, 375–407.
https://doi.org/10.1037/0033-295X.88.5.375
Miller, K.J., Ludvig, E.A., Pezzulo, G., Shenhav, A., 2018. Chapter 18 - Realigning Models of Habitual
and Goal-Directed Decision-Making, in: Morris, R., Bornstein, A., Shenhav, A. (Eds.), Goal-Directed
Decision Making. Academic Press, pp. 407–428. https://doi.org/10.1016/B978-0-12-812098-9.00018-
8
Milli, S., Lieder, F., Griffiths, T.L., 2021. A rational reinterpretation of dual-process theories. Cognition
217, 104881. https://doi.org/10.1016/j.cognition.2021.104881
Mittenbühler, M., Schwöbel, S., Dignath, D., Kiebel, S., Butz, M.V., 2024. A Rational Trade-Off
Between the Costs and Benefits of Automatic and Controlled Processing. Proceedings of the Annual
Meeting of the Cognitive Science Society 46.
52
Monsell, S., 2003. Task switching. Trends Cogn Sci 7, 134–140. https://doi.org/10.1016/s1364-
6613(03)00028-7
Montague, P., Dayan, P., Sejnowski, T., 1996. A framework for mesencephalic dopamine systems
based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947.
https://doi.org/10.1523/JNEUROSCI.16-05-01936.1996
Moors, A., De Houwer, J., 2006. Automaticity: A Theoretical and Conceptual Analysis. Psychological
Bulletin 132, 297–326. https://doi.org/10.1037/0033-2909.132.2.297
Musslick, S., Shenhav, A., Botvinick, M., Cohen, J., 2015. A Computational Model of Control
Allocation based on the Expected Value of Control.
Nee, D.E., Wager, T.D., Jonides, J., 2007. Interference resolution: Insights from a meta-analysis of
neuroimaging tasks. Cognitive, Affective, & Behavioral Neuroscience 7, 1–17.
https://doi.org/10.3758/CABN.7.1.1
Neys, W.D., 2023. Advancing theorizing about fast-and-slow thinking. Behavioral and Brain Sciences
46, e111. https://doi.org/10.1017/S0140525X2200142X
Norman, D.A., Shallice, T., 1986. Attention to Action, in: Davidson, R.J., Schwartz, G.E., Shapiro, D.
(Eds.), Consciousness and Self-Regulation: Advances in Research and Theory Volume 4. Springer US,
Boston, MA, pp. 1–18. https://doi.org/10.1007/978-1-4757-0629-1_1
Padmala, S., Pessoa, L., 2011. Reward Reduces Conflict by Enhancing Attentional Control and Biasing
Visual Cortical Processing. Journal of Cognitive Neuroscience 23, 3419–3432.
https://doi.org/10.1162/jocn_a_00011
Parr, T., Holmes, E., Friston, K.J., Pezzulo, G., 2023. Cognitive effort and active inference.
Neuropsychologia 184, 108562. https://doi.org/10.1016/j.neuropsychologia.2023.108562
Parr, T., Pezzulo, G., Friston, K., 2022. Active Inference: The Free Energy Principle in Mind, Brain,
and Behavior. MIT Press, Cambridge, MA, USA.
Paus, T., Petrides, M., Evans, A.C., Meyer, E., 1993. Role of the human anterior cingulate cortex in the
control of oculomotor, manual, and speech responses: a positron emission tomography study. Journal
of Neurophysiology 70, 453–469. https://doi.org/10.1152/jn.1993.70.2.453
Peters, J., Büchel, C., 2010. Episodic future thinking reduces reward delay discounting through an
enhancement of prefrontal-mediotemporal interactions. Neuron 66, 138–148.
https://doi.org/10.1016/j.neuron.2010.03.026
Pezzulo, G., 2012. An Active Inference view of cognitive control. Frontiers in Theoretical and
Philosophical Psychology. https://doi.org/10.3389/fpsyg.2012.00478
53
Pezzulo, G., Parr, T., Friston, K., 2024. Active inference as a theory of sentient behavior. Biological
Psychology 186, 108741. https://doi.org/10.1016/j.biopsycho.2023.108741
Pezzulo, G., Rigoli, F., 2011. The value of foresight: how prospection affects decision-making. Front
Neurosci 5, 79. https://doi.org/10.3389/fnins.2011.00079
Pezzulo, G., Rigoli, F., Chersi, F., 2013. The Mixed Instrumental Controller: Using Value of
Information to Combine Habitual Choice and Mental Simulation. Frontiers in Psychology 4, 92.
https://doi.org/10.3389/fpsyg.2013.00092
Pezzulo, G., Rigoli, F., Friston, K., 2015. Active Inference, homeostatic regulation and adaptive
behavioural control. Progress in Neurobiology 134, 17–35.
https://doi.org/10.1016/j.pneurobio.2015.09.001
Pezzulo, G., Rigoli, F., Friston, K.J., 2018a. Hierarchical Active Inference: A Theory of Motivated
Control. Trends in Cognitive Sciences 22, 294–306. https://doi.org/10.1016/j.tics.2018.01.009
Pezzulo, G., Rigoli, F., Friston, K.J., 2018b. Hierarchical active inference: a theory of motivated
control. Trends in cognitive sciences 22, 294–306.
Pio-Lopez, L., Nizard, A., Friston, K., Pezzulo, G., 2016. Active inference and robot control: a case
study. J. R. Soc. Interface. 13, 20160616. https://doi.org/10.1098/rsif.2016.0616
Posner, M.I., 1978. Chronometric explorations of mind, Chronometric explorations of mind. Lawrence
Erlbaum, Oxford, England.
Rabbitt, P.M., 1966. Errors and error correction in choice-response tasks. Journal of Experimental
Psychology 71, 264–272. https://doi.org/10.1037/h0022853
Rubin, J., Shamir, O., Tishby, N., 2012. Trading value and information in MDPs, in: Decision Making
with Imperfect Decision Makers. Springer, pp. 57–74.
Rubinstein, J.S., Meyer, D.E., Evans, J.E., 2001. Executive control of cognitive processes in task
switching. J Exp Psychol Hum Percept Perform 27, 763–797. https://doi.org/10.1037//0096-
1523.27.4.763
Rumelhart, D.E., Norman, D.A., 1982. Simulating a Skilled Typist: A Study of Skilled Cognitive-Motor
Performance. Cognitive Science 6, 1–36. https://doi.org/10.1207/s15516709cog0601_1
Rumiati, R.I., Tessari, A., 2002. Imitation of novel and well-known actions: the role of short-term
memory. Exp Brain Res 142, 425–433. https://doi.org/10.1007/s00221-001-0956-x
Salamone, J.D., Pardo, M., Yohn, S.E., López-Cruz, L., SanMiguel, N., Correa, M., 2016a. Mesolimbic
Dopamine and the Regulation of Motivated Behavior. Curr Top Behav Neurosci 27, 231–257.
54
https://doi.org/10.1007/7854_2015_383
Salamone, J.D., Yohn, S.E., López-Cruz, L., San Miguel, N., Correa, M., 2016b. Activational and
effort-related aspects of motivation: neural mechanisms and implications for psychopathology. Brain
139, 1325–1347. https://doi.org/10.1093/brain/aww050
Sales, A.C., Friston, K.J., Jones, M.W., Pickering, A.E., Moran, R.J., 2019. Locus Coeruleus tracking
of prediction errors optimises cognitive flexibility: An Active Inference model. PLOS Computational
Biology 15, e1006267. https://doi.org/10.1371/journal.pcbi.1006267
Sawaguchi, T., Goldman-Rakic, P.S., 1991. D1 dopamine receptors in prefrontal cortex: involvement
in working memory. Science 251, 947–950. https://doi.org/10.1126/science.1825731
Schneider, W., Chein, J.M., 2003. Controlled & automatic processing: behavior, theory, and biological
mechanisms. Cognitive Science 27, 525–559. https://doi.org/10.1207/s15516709cog2703_8
Schultz, W., 1998. Predictive Reward Signal of Dopamine Neurons. Journal of Neurophysiology 80,
1–27. https://doi.org/10.1152/jn.1998.80.1.1
Schultz, W., Dayan, P., Montague, P.R., 1997. A neural substrate of prediction and reward. Science
275, 1593–1599. https://doi.org/10.1126/science.275.5306.1593
Schwartenbeck, P., FitzGerald, T.H.B., Mathys, C., Dolan, R., Friston, K., 2015a. The Dopaminergic
Midbrain Encodes the Expected Certainty about Desired Outcomes. Cereb. Cortex 25, 3434–3445.
https://doi.org/10.1093/cercor/bhu159
Schwartenbeck, P., FitzGerald, T.H.B., Mathys, C., Dolan, R., Kronbichler, M., Friston, K., 2015b.
Evidence for surprise minimization over value maximization in choice behavior. Sci Rep 5, 16575.
https://doi.org/10.1038/srep16575
Schwöbel, S., Marković, D., Smolka, M.N., Kiebel, S.J., 2021. Balancing control: A Bayesian
interpretation of habitual and goal-directed behavior. Journal of Mathematical Psychology 100, 102472.
https://doi.org/10.1016/j.jmp.2020.102472
Shallice, T., Burgess, P., 1993. Supervisory control of action and thought selection, in: Attention:
Selection, Awareness, and Control: A Tribute to Donald Broadbent. Clarendon Press/Oxford
University Press, New York, NY, US, pp. 171–187.
Sharot, T., Guitart-Masip, M., Korn, C.W., Chowdhury, R., Dolan, R.J., 2012. How dopamine enhances
an optimism bias in humans. Curr Biol 22, 1477–1481. https://doi.org/10.1016/j.cub.2012.05.053
Shenhav, A., Botvinick, M.M., Cohen, J.D., 2013. The expected value of control: An integrative theory
of anterior cingulate cortex function. Neuron 79, 217–240.
https://doi.org/10.1016/j.neuron.2013.07.007
55
Shiffrin, R.M., Schneider, W., 1977. Controlled and automatic human information processing: II.
Perceptual learning, automatic attending and a general theory. Psychological Review 84, 127–190.
https://doi.org/10.1037/0033-295X.84.2.127
Shine, J.M., 2023. Neuromodulatory control of complex adaptive dynamics in the brain. Interface Focus
13, 20220079. https://doi.org/10.1098/rsfs.2022.0079
Silton, R.L., Heller, W., Towers, D.N., Engels, A.S., Spielberg, J.M., Edgar, J.C., Sass, S.M., Stewart,
J.L., Sutton, B.P., Banich, M.T., Miller, G.A., 2010. The time course of activity in dorsolateral
prefrontal cortex and anterior cingulate cortex during top-down attentional control. NeuroImage 50,
1292–1302. https://doi.org/10.1016/j.neuroimage.2009.12.061
Silvetti, M., Lasaponara, S., Daddaoua, N., Horan, M., Gottlieb, J., 2023. A Reinforcement Meta-
Learning framework of executive function and information demand. Neural Networks 157, 103–113.
https://doi.org/10.1016/j.neunet.2022.10.004
Silvetti, M., Seurinck, R., Verguts, T., 2011. Value and Prediction Error in Medial Frontal Cortex:
Integrating the Single-Unit and Systems Levels of Analysis. Front Hum Neurosci 5, 75.
https://doi.org/10.3389/fnhum.2011.00075
Silvetti, M., Vassena, E., Abrahamse, E., Verguts, T., 2018. Dorsal anterior cingulate-brainstem
ensemble as a reinforcement meta-learner. PLOS Computational Biology 14, e1006370.
https://doi.org/10.1371/journal.pcbi.1006370
Smith, R., Schwartenbeck, P., Stewart, J.L., Kuplicki, R., Ekhtiari, H., Paulus, M.P., 2020. Imprecise
action selection in substance use disorder: Evidence for active learning impairments when solving the
explore-exploit dilemma. Drug and Alcohol Dependence 215, 108208.
https://doi.org/10.1016/j.drugalcdep.2020.108208
Sohn, M.-H., Albert, M.V., Jung, K., Carter, C.S., Anderson, J.R., 2007. Anticipation of conflict
monitoring in the anterior cingulate cortex and the prefrontal cortex. Proceedings of the National
Academy of Sciences 104, 10330–10334. https://doi.org/10.1073/pnas.0703225104
Sozou, P.D., 1998. On hyperbolic discounting and uncertain hazard rates. Proceedings of the Royal
Society of London. Series B: Biological Sciences 265, 2015–2020.
https://doi.org/10.1098/rspb.1998.0534
Stanovich, K.E., West, R.F., 2000. Individual differences in reasoning: Implications for the rationality
debate? Behavioral and Brain Sciences 23, 645–665. https://doi.org/10.1017/S0140525X00003435
Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA.
Swart, J.C., Froböse, M.I., Cook, J.L., Geurts, D.E., Frank, M.J., Cools, R., den Ouden, H.E., 2017.
56
Catecholaminergic challenge uncovers distinct Pavlovian and instrumental mechanisms of motivated
(in)action. eLife 6, e22169. https://doi.org/10.7554/eLife.22169
Taatgen, N.A., Lee, F.J., 2003. Production Compilation: A Simple Mechanism to Model Complex Skill
Acquisition. Hum Factors 45, 61–76. https://doi.org/10.1518/hfes.45.1.61.27224
Taniguchi, T., Murata, S., Suzuki, M., Ognibene, D., Lanillos, P., Ugur, E., Jamone, L., Nakamura, T.,
Ciria, A., Lara, B., Pezzulo, G., 2023. World Models and Predictive Coding for Cognitive and
Developmental Robotics: Frontiers and Challenges. Advanced Robotics 37.
https://doi.org/10.48550/arXiv.2301.05832
Tessari, A., Bosanac, D., Rumiati, R.I., 2006. Effect of learning on imitation of new actions:
implications for a memory model. Exp Brain Res 173, 507–513. https://doi.org/10.1007/s00221-006-
0395-9
Tessari, A., Mengotti, P., Faccioli, L., Tuozzi, G., Boscarato, S., Taricco, M., Rumiati, R.I., 2021. Effect
of body-part specificity and meaning in gesture imitation in left hemisphere stroke patients.
Neuropsychologia 151, 107720. https://doi.org/10.1016/j.neuropsychologia.2020.107720
Tobler, P.N., Fiorillo, C.D., Schultz, W., 2005. Adaptive Coding of Reward Value by Dopamine
Neurons. Science 307, 1642–1645. https://doi.org/10.1126/science.1105370
Todorov, E., 2009. Efficient computation of optimal actions. Proc Natl Acad Sci U S A 106, 11478–
11483. https://doi.org/10.1073/pnas.0710743106
Vassena, E., Deraeve, J., Alexander, W.H., 2020. Surprise, value and control in anterior cingulate cortex
during speeded decision-making. Nat Hum Behav 4, 412–422. https://doi.org/10.1038/s41562-019-
0801-5
Vriens, T., Vassena, E., Pezzulo, G., Baldassarre, G., Silvetti, M., 2025. Meta-Reinforcement Learning
reconciles surprise, value, and control in the anterior cingulate cortex. PLOS Computational Biology
21, e1013025. https://doi.org/10.1371/journal.pcbi.1013025
Wang, J.X., Kurth-Nelson, Z., Kumaran, D., Tirumala, D., Soyer, H., Leibo, J.Z., Hassabis, D.,
Botvinick, M., 2018. Prefrontal cortex as a meta-reinforcement learning system. Nat Neurosci 21, 860–
868. https://doi.org/10.1038/s41593-018-0147-8
Wei, K., Körding, K., 2009. Relevance of Error: What Drives Motor Adaptation? Journal of
Neurophysiology 101, 655–664. https://doi.org/10.1152/jn.90545.2008
Westbrook, A., Braver, T.S., 2016. Dopamine Does Double Duty in Motivating Cognitive Effort.
Neuron 89, 695–710. https://doi.org/10.1016/j.neuron.2015.12.029
Westbrook, A., van den Bosch, R., Määttä, J.I., Hofmans, L., Papadopetraki, D., Cools, R., Frank, M.J.,
57
2020. Dopamine promotes cognitive effort by biasing the benefits versus costs of cognitive work.
Science 367, 1362–1366. https://doi.org/10.1126/science.aaz5891
Zénon, A., Solopchuk, O., Pezzulo, G., 2019. An information-theoretic perspective on the costs of
cognition. Neuropsychologia 123, 5–18. https://doi.org/10.1016/j.neuropsychologia.2018.09.013
58