0% found this document useful (0 votes)
6 views16 pages

1 s2.0 S0167739X24001262 Main

Uploaded by

sherwingao99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views16 pages

1 s2.0 S0167739X24001262 Main

Uploaded by

sherwingao99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Future Generation Computer Systems 157 (2024) 469–484

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Potential-based reward shaping using state–space segmentation for


efficiency in reinforcement learning
Melis İlayda Bal a , Hüseyin Aydın b,c , Cem İyigün d , Faruk Polat b ,∗
a Max Planck Institute for Intelligent Systems, Tübingen, 72076, Germany
b Department of Computer Engineering, Middle East Technical University, Ankara, 06800, Turkey
c
Department of Information and Computing Science, Utrecht University, Utrecht, 3584 CE, Netherlands
d
Department of Industrial Engineering, Middle East Technical University, Ankara, 06800, Turkey

ARTICLE INFO ABSTRACT

Keywords: Reinforcement Learning (RL) algorithms encounter slow learning in environments with sparse explicit reward
Reinforcement learning structures due to the limited feedback available on the agent’s behavior. This problem is exacerbated
State–space segmentation particularly in complex tasks with large state and action spaces. To address this inefficiency, in this paper, we
Potential-based reward shaping
propose a novel approach based on potential-based reward-shaping using state–space segmentation to decompose
Reward shaping
the task and to provide more frequent feedback to the agent. Our approach involves extracting state–space
Sparse rewards
segments by formulating the problem as a minimum cut problem on a transition graph, constructed using
the agent’s experiences during interactions with the environment via the Extended Segmented Q-Cut algorithm.
Subsequently, these segments are leveraged in the agent’s learning process through potential-based reward
shaping. Our experimentation on benchmark problem domains with sparse rewards demonstrated that our
proposed method effectively accelerates the agent’s learning without compromising computation time while
upholding the policy invariance principle.

1. Introduction the slow learning problem in sparse reward environments. By using


an existing subgoal identification method, namely Segmented Q-Cut,
Sparse explicit reward structures present a significant challenge for our method presents a graph-theoretical way to segment the state–
Reinforcement Learning (RL) agents. The lack of immediate feedback space using the RL agent’s transition history in the environment and
on the behavior of the learning agent combined with the underlying provides a novel way to benefit from extracted segment information in
delayed nature of the feedback results in an RL agent suffering from
the agent’s learning process through a potential-based reward shaping
slow learning. Therefore, the RL agent performs poorly, particularly
mechanism.
in real-world tasks having sparse rewards and large state and action
While our approach provides a generic framework for segmenting
spaces.
To accelerate the RL agent’s learning in such settings, common ap- the agent’s state space in the given problem, we specifically adopt
proaches in the literature center around task decomposition or reward- Segmented Q-Cut as our base method. This choice is driven by its
based methods to make reward function denser. Studies that aim to recursive nature, which allows us to identify potentially multiple sub-
speed up learning with task decomposition, divide the complex RL goals. However, our perspective significantly diverges from the study
problem into simpler sub-problems by identifying relatively important conducted by Menache et al. [1], as our objective is to expedite learning
states, labeling them as bottlenecks or subgoals [1] and then learning by offering dense feedback to the agent through a potential-based
macro-policies generated with the options framework to reach those reward shaping mechanism. This mechanism relies on the identified
important states [2]. On the other hand, reward-based methods mostly state-space segmentation, rather than generating macro-policies based
focus on reward shaping that provides additional rewards to the agent on subgoal identification.
to alleviate the sparse and delayed feedback nature of the problem to The experimental results in benchmark sparse reward environments
improve the learning speed [3–5].
showed that our method improves the RL agent’s learning speed no-
In this paper, we propose a reward shaping approach called poten-
tably compared to the baseline Q-Learning agent without needing
tial-based reward shaping using state–space segmentation to address

∗ Corresponding author.
E-mail addresses: [email protected] (M.İ. Bal), [email protected] (H. Aydın), [email protected] (C. İyigün), [email protected]
(F. Polat).

https://doi.org/10.1016/j.future.2024.03.057
Received 5 October 2023; Received in revised form 27 March 2024; Accepted 31 March 2024
Available online 7 April 2024
0167-739X/© 2024 Elsevier B.V. All rights reserved.
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

to sacrifice computation time while preserving the policy invariance et al. propose a landmark-based reward shaping to advance learning
principle. speed and quality [4]. They define the potentials in terms of the value
of landmarks, however, they focus on RL problems with hidden states.
Contributions. We make the following main contributions:
Conversely, our approach works on fully observable RL problems and
uses potentials in terms of the Q-value of the state–space segments.
• We propose a novel potential-based reward-shaping mechanism
to improve the efficiency of the RL algorithm in sparse reward Subgoal discovery. To accelerate learning, studies based on task decom-
environments, by using a potential function defined via a graph- position leverage the learning of macro policies defined over identified
theoretic state–space segmentation strategy. critical states, referred to as subgoals [2,8,22–24]. While some of these
• We extend the Segmented Q-Cut algorithm [1] to identify space methods use graph theoretical approaches as we do, some of them are
segments using a minimum cut problem formulation on a transi- based on the statistical analysis of agent transitions. Nevertheless, our
tion graph constructed by the agent in an online manner i.e. dur- study diverges from all these methods as it does not necessitate the
ing its interaction with the environment. additional learning of such macro policies.
• We experimentally validate the performance of the proposed re-
ward shaping mechanism on several benchmark sparse reward RL 3. Background
domains. The results show that our reward-shaping formulation
indeed improves sample efficiency while preserving the policy 3.1. Reinforcement learning
invariance.
A sequential decision-making process under uncertainty is the clas-
The rest of the paper is organized as follows. Section 2 provides the sical representation of a Reinforcement Learning problem in which
existing studies considered as related work. The necessary background actions taken by an agent in a sequence are associated with uncertain
for a better understanding of the remaining material is given in Sec- outcomes. Formally, an RL problem is generally modeled with a Markov
tion 3. Section 4 covers the proposed method in detail. Experimental Decision Process (MDP) assuming that the Markov property holds [25].
results are presented and discussed in Section 5. Finally, Section 6 A Markov Decision Process (MDP) is described with the tuple
wraps up with the conclusions and possible future research directions. ⟨S, A, T , R , 𝛾⟩ where S: a finite set of states, A: a finite set of actions,
T ∶ S × A × S → [0, 1]: a transition function T which maps the state–
2. Related work action pairs to a probability distribution over states, R ∶ S × A → R:
a reward function R which provides immediate reward after an action
Sparse rewards. Sparse reward scenarios, where the agent encounters choice in some state ∈ S, and 𝛾 ∈ [0, 1]: a discount factor that shows
rewards infrequently or not at all for the majority of its actions, present the importance of future rewards to the current state.
a formidable challenge for reinforcement learning algorithms. To ad- The RL agent’s interaction with the environment starts with observ-
dress this issue, studies have employed diverse strategies, including ex- ing its state information 𝑠𝑡 ∈ S at the decision step 𝑡 ∈ 𝑇 where 𝑇 can
ploration techniques [6], reward-based approaches [3–5], task decom- be finite in episodic or ∞ for continuing tasks. Based on the observed
position [1], imitation learning [7,8], and expert demonstrations [9]. state, the agent chooses an action 𝑎𝑡 from the admissible action set A.
One prominent avenue of research involves reward shaping [10], which At each time step, the agent chooses actions according to its policy 𝜋,
involves modifying the reward signal to provide additional feedback defined as 𝜋 ∶ S × A → [0, 1] the probability distribution over state–
to the agent. Various forms of additional reward signals are used in action pairs. Whether the agent has chosen a good or bad action is
reward shaping mechanisms such as exploration bonuses [5,6], belief- reflected with an immediate reward 𝑟𝑡 (𝑠𝑡 , 𝑎𝑡 ) ∈ R based on the reward
based signals [11], or intrinsic motivation-based signals depending on function R (𝑠𝑡 , 𝑎𝑡 ). After receiving feedback from the environment, the
state novelty [12] or curiosity [13]. Our proposed approach is under the agent then transitions to a new state 𝑠𝑡+1 ∈ S according to the
line of reward-based strategies, however, it leverages the identification transition function T (𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡+1 ). This interaction process between the
of subgoals and, hence, state–space segments. agent and the environment continues until the agent learns to perform
the RL task successfully. Therefore, after each feedback signal from the
Potential-based reward shaping. A central line of work, potential-based
environment, the agent adjusts its behavior with the goal of extracting
reward shaping (PBRS), is introduced in [14], where an artificial poten-
the best action sequences that enable it to master the RL task.
tial function is used to encourage the agent towards desirable states.
The goal of the RL agent is formally defined as maximizing its
However, careful design is crucial, as poorly crafted reward shaping
long-term return 𝐺̃ 𝑡 with
can introduce biases or suboptimal behavior [15]. Okudo and Yamada
introduce a PBRS approach in which the potentials are defined in terms ∑
𝑇

of subgoal achievements [3]. However, the agent acquires subgoals 𝐺̃ 𝑡 ≐ 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾 2 𝑟𝑡+3 + ⋯ ≐ 𝛾 𝑘−𝑡−1 𝑟𝑘 = 𝑟𝑡+1 + 𝛾 𝐺̃ 𝑡+1 , (1)
𝑘=𝑡+1
from human participants instead of automatic identification. [9] pro-
poses a potential function using expert demonstrations. They assign where 𝑇 shows the time at which the episode ends. If the problem is
a potential to the state action pairs according to their similarity to non-episodic, 𝑇 can be replaced with ∞ in this formula [26].
those seen in the demonstrations. A recent study in [16] provides The value 𝑉𝜋 (𝑠) of a state 𝑠 under a policy 𝜋 defines the expected
the agent potential-based language reward which is derived from an return starting from state 𝑠 and following policy 𝜋,
action frequency vector, however, their language-based reward func- [ ]
𝑉 𝜋 (𝑠) ≐ E𝜋 𝐺̃ 𝑡 ∣ 𝑠𝑡 = 𝑠 , (2)
tion overlooks past states. Furthermore, a distance-based potential
function derived from the A* search algorithm is introduced for a whereas the action-value 𝑄𝜋 (𝑠, 𝑎) of state–action pair 𝑠, 𝑎 under policy
specific sparse reward planning task [17]. On the other hand, potential 𝜋 defines the expected return starting from state 𝑠, taking action 𝑎 and
functions can also be plan-based [18], or auxiliary reward functions- then following policy 𝜋,
based [19]. Unlike these, our proposed reward shaping mechanism [ ]
𝑄𝜋 (𝑠, 𝑎) ≐ E𝜋 𝐺̃ 𝑡 ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 . (3)
uses a graph-theoretical potential function defined over state–space seg-
ments discovered by the agent in an online manner without requiring In well-known strategies, RL agent aims to discover the optimal pol-
any expert or prior knowledge. icy that provides the maximum expected cumulative future discounted
Although utilization of PBRS on RL problems that include a single reward by first estimating value or action-value functions that represent
agent is common, there are also studies applying PBRS on Multi-Agent the values of the policies to search and evaluate the policies in the
Reinforcement Learning settings such as [20,21]. Furthermore, Demir policy space.

470
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Temporal-difference (TD) learning is the central idea of most RL 3.3. Potential-based reward shaping
algorithms that stands on the incremental computation of value esti-
mations. TD approaches differ from Monte Carlo methods as they boot- Potential-based reward shaping (PBRS) introduced by Ng et al. [14]
strap i.e. they update the value estimates based on the learned value is a way to shape rewards to deal with the sparse reward function
estimates of successor state–action pairs with the following update rule R while preserving policy invariance. The dense reward function R ′
is obtained with additional reward signals in the form of potentials
( ) ( ) provided to the agent without changing the optimal policy for the
𝑄𝑡+1 𝑠𝑡 , 𝑎𝑡 ← 𝑄𝑡 𝑠𝑡 , 𝑎𝑡 + 𝛼𝛿𝑡 , (4) underlying MDP, a.k.a. policy invariance principle.
( ) ( ) In PBRS, shaping reward function denoted with 𝐹 , 𝐹 ∶ S × S → R,
where 𝛿𝑡 = 𝑟𝑡+1 + 𝛾𝑄𝑡 𝑠𝑡+1 , 𝑎𝑡+1 − 𝑄𝑡 𝑠𝑡 , 𝑎𝑡 is the one-step TD error for
state–action value after transition to next state 𝑠𝑡+1 and receiving 𝑟𝑡+1 . is defined as the difference between real-valued potential functions
(4) is applied in the Sarsa algorithm which is a well-known on-policy of the successive states for a state transition. The potential of a state
TD control approach extending from standard one-step TD learning, or represented by the function 𝛷, 𝛷 ∶ S → R, expresses some knowledge
TD(0) [27]. The off-policy version of the TD error of the environment as a real-value given state information.
( ) ( ) Formally, the potential-based shaping function 𝐹 (𝑠, 𝑠′ ) for a state
𝛿𝑡 = 𝑟𝑡+1 + 𝛾 max 𝑄𝑡 𝑠𝑡+1 , 𝑎 − 𝑄𝑡 𝑠𝑡 , 𝑎𝑡 (5) transition 𝑠 → 𝑠′ is defined by
𝑎

is used in the Q-Learning [28] algorithm. An extension of one-step TD 𝐹 (𝑠, 𝑠′ ) = 𝛾𝛷(𝑠′ ) − 𝛷(𝑠), (10)
that employs multi-step bootstrapping is n-step TD learning, TD(n), in
where 𝛾 is the discount factor.
which the bootstrapping considers n-step successor state–action pairs
Ng et al. prove that if 𝐹 is a potential-based reward shaping function
as
as defined in (10), then every optimal policy for the modified MDP 𝑀 ′
( ) ( )
𝑄𝑡+𝑛 𝑠𝑡 , 𝑎𝑡 ← 𝑄𝑡+𝑛−1 𝑠𝑡 , 𝑎𝑡 + 𝛼𝛿𝑡∶𝑡+𝑛 , (6) will be an optimal policy in the original MDP 𝑀 and vice versa [14].
( ) Hence, defining 𝐹 in the form given in (10) is sufficient to guarantee
where 𝛿𝑡∶𝑡+𝑛 = 𝐺𝑡∶𝑡+𝑛 − 𝑄𝑡+𝑛−1 𝑠𝑡 , 𝑎𝑡 is the n-step TD error and 𝐺𝑡∶𝑡+𝑛 ≐
( ) the policy invariance. Furthermore, the definition of both the shaping
𝑟𝑡+1 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑛−1 𝑟𝑡+𝑛 + 𝛾 𝑛 𝑄𝑡+𝑛−1 𝑠𝑡+𝑛 , 𝑎𝑡+𝑛 . Following this, in the and the potential functions are extended by [9] to include actions while
learning settings involving temporally extended actions referred to as still satisfying the policy invariance principle.
macro actions, a.k.a. options, Macro Q-Learning [29,30] with the update
rule 3.4. Segmented Q-cut algorithm
( ) ( ) [ ( ) ( )]
𝑄 𝑠𝑡 , 𝑚 ← 𝑄 𝑠𝑡 , 𝑚 + 𝛼 𝑟 + 𝛾 𝑛 max 𝑄 𝑠𝑡+𝑛 , 𝑎 − 𝑄 𝑠𝑡 , 𝑚 , (7)
𝑎 An expedient strategy to accelerate the learning process in RL
where involves task decomposition. Task-decomposition methods require the
identification of subgoals or bottlenecks to divide the task into smaller
𝑟 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + ⋯ + 𝛾 𝑛−1 𝑟𝑡+𝑛 , (8) sub-tasks and learn useful skills to successfully complete each sub-task.
is employed. Here, the approximate action value is updated after a A significant line of work leverages identifying subgoals as cuts on the
multi-step transition from state 𝑠𝑡 to 𝑠𝑡+𝑛 using macro action 𝑚. agent’s transition graph. A transition graph in the RL context, is a special
TD methods are faster due to online learning, but less stable because kind of graph structure that stores the state transitions in an MDP. Let 𝐺
of their sensitivity to initial estimates. be a transition graph defined as 𝐺 = ⟨𝑁, 𝐴⟩. 𝐺 is a capacitated directed
network in which the nodes (𝑁) denote the states whereas the arcs
(𝐴) denote the state transitions. To illustrate, the transition from state
3.2. Reward shaping
𝑠 → 𝑠′ is reflected in the graph 𝐺 with arc (𝑠, 𝑠′ ) ∈ 𝐴. The arc capacity
definition depends on the focused RL task. Previous studies that aim
Without frequent reinforcement signals to update the behavior, the
to find subgoal states mostly use state visitation frequency-based arc
RL agent struggles to discover the optimal policy in sparse and/or
capacities [1]. A minimum cut of the transition graph 𝐺 is the set of
delayed reward environments. To handle this, reward shaping mech-
arcs whose removal from 𝐺 divides it into two disjoint subgraphs and
anisms modify the reward function in a way where denser feedback
minimal with respect to some measure, e.g., the capacity of the cut arcs.
signals are provided to the agent [10].
A fundamental work belonging to this line, Q-Cut, identifies sub-
A reward shaping mechanism transforms the original MDP 𝑀 = goals by solving the Max-Flow Min-Cut problem on the transition graph
⟨S, A, T , R , 𝛾⟩ to a modified MDP 𝑀 ′ = ⟨S, A, T , R ′ , 𝛾⟩ where R ′ = R + 𝐹 of the agent’s experiences in the MDP and finds cuts of the entire state-
is the transformed reward function and 𝐹 ∶ S × A → R represents space [1], as also provided in Algorithm 1. Moreover, the LCut method
the shaping reward function [14,31]. Shaping function 𝐹 encourages focuses on local transition graphs and uses a partitioning algorithm
the agent to improve its knowledge of the environment and adjust its to extract the local cuts [23]. Both studies suggest generating macro-
behavior accordingly. actions in the form of an options framework for agents to learn skills
The shaping reward function 𝐹 can be defined in different forms to reach those subgoals [2].
based on the goal of the shaping. For instance, a shaping function that is Menache et al. also propose the automatic identification of subgoals
defined over state–action pairs may produce a bonus term to encourage (bottlenecks) in a dynamic environment with the Segmented Q-Cut
the agent for the exploration of novel state–action pairs [32]. The agent algorithm [1]. The Segmented Q-Cut is an extension of the Q-Cut
receives a bonus term represented with B𝑡 (𝑠, 𝑎) over its extrinsic reward algorithm, in which the general idea is to apply cuts on the state
signal 𝑟𝑡 (𝑠, 𝑎) after visiting the state–action pair (𝑠, 𝑎) at time step 𝑡. segments recursively. In this manner, the Segmented Q-Cut outperforms
Hence, the final reward value for visiting the (𝑠, 𝑎) pair at time step the Q-Cut algorithm for finding multiple subgoals in the given problem.
𝑡 becomes A subgoal or bottleneck is defined as the border state of strongly
connected areas of the transition history graph of the agent. As specified
𝑟̃𝑡 (𝑠, 𝑎) = 𝑟𝑡 (𝑠, 𝑎) ⊕ B𝑡 (𝑠, 𝑎), (9)
before, the transition graph holds the agent’s experiences as state
where ⊕ is the aggregation operation on extrinsic reward and the bonus transitions in a directed and capacitated graph. The capacity of the arcs
term [33]. in the transition graph is defined with the relative frequency measure.
Moreover, in the potential-based version of reward shaping, the Relative frequency 𝑐𝑟𝑓 is computed by
function 𝐹 is defined using some potential function of states which is 𝑛(𝑖 → 𝑗)
explained in the following subsection in detail. 𝑐𝑟𝑓 = , (11)
𝑛(𝑖)

471
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Algorithm 1 Q-Cut Algorithm Algorithm 3 Cut Procedure of Segmented Q-Cut Algorithm


1: procedure QCut 1: procedure CutProcedureForSegment
2: while True do 2: require: Segment 𝑁
3: Interact with environment / learn via Macro-𝑄 Learning 3: Extend segment 𝑁 by connectivity testing
4: Save state transition history 4: Translate state transition history of segment 𝑁 to a graph
5: if activating cut conditions are met then representation
6: Choose source and sink (𝑠, 𝑡 ∈ 𝑆) 5: for each 𝑠 ∈ 𝑆(𝑁) do
7: Translate state transition history to a graph 𝐺 6: Perform Min-Cut on the extended segment
8: Find a Minimum Cut partition [𝑁𝑠 , 𝑁𝑡 ] between 𝑠 and 𝑡 (𝑠 as source, choice of 𝑡 is task depended)
9: if the cut’s quality is good (bottlenecks are found) then 7: if the cut’s quality is good (bottlenecks are found) then
10: Learn the option for reaching the bottlenecks from 8: Separate the extended 𝑁 into two segments 𝑁𝑠 and 𝑁𝑡
every state in 𝑁𝑠 , using Experience Replay 9: Learn the option for reaching the bottlenecks from
11: end if every state in 𝑁𝑠 , using Experience Replay
( )
12: end if 10: Save new bottlenecks in 𝑆 𝑁𝑡
13: end while 11: end if
14: end procedure 12: end for
13: end procedure

Algorithm 2 Segmented Q-Cut Algorithm


1: procedure SegmentedQCut
4.1. Problem motivation
2: Create an empty segment 𝑁0
3: Include starting state 𝑠0 in segment 𝑁0
( ) Sparse and/or delayed reward environments have posed significant
4: Include starting state 𝑠0 in 𝑆 𝑁0
5: while True do challenges for RL approaches. While recent advancements in the RL
6: Interact with environment / learn via Macro-𝑄 Learning framework have shown promise in such domains, slow learning remains
7: Save state transition history a persistent issue. This challenge becomes even more pronounced when
8: for each segment 𝑁 do RL is applied to real-world problems characterized by large state and
9: if activating cut conditions are met then action spaces, further exacerbating the required learning time. Effec-
10: perform 𝐶𝑢𝑡𝑃 𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒𝐹 𝑜𝑟𝑆𝑒𝑔𝑚𝑒𝑛𝑡(𝑁) tively addressing this problem involves exploring two key directions.
11: end if Firstly, decomposing the complex task into sub-tasks enables parallel
12: end for learning of each simpler sub-task, thereby reducing overall learning
13: end while time. Secondly, leveraging the domain knowledge extracted during the
14: end procedure agent–environment interaction can enhance the agent’s learning pro-
cess. Reward shaping is one of the ways that aims to use the extracted
knowledge to improve the speed of the learning process. Since reward
signals are an essential part of the efficiency of the RL approaches, we
where 𝑛(𝑖 → 𝑗) denotes the number of times transition from state 𝑖 to attack the slow learning problem in terms of rewards. To this end, our
state 𝑗 is occurred and 𝑛(𝑖) shows the number of times state 𝑖 is visited. proposed method introduces a potential-based reward-shaping mech-
Ratio cut bi-partitioning metric measures the quality or significance anism that depends on the segmentation of the state–space computed
of a cut. Let 𝑠 − 𝑡 𝑐𝑢𝑡 be a cut of a connected graph 𝐺 that separates the with the ESegQ-Cut algorithm.
graph into two partitions as source and sink partitions denoted with 𝑁𝑠
and 𝑁𝑡 . The quality of 𝑠 − 𝑡 𝑐𝑢𝑡 is computed by
4.2. Reward shaping based on state–space segmentation with the extended
∣ 𝑁𝑠 ∣∣ 𝑁𝑡 ∣ segmented Q-cut algorithm
𝑞𝑐𝑢𝑡 (𝑠, 𝑡) = , (12)
∣ 𝐴(𝑁𝑠 , 𝑁𝑡 ) ∣
where ∣ 𝑁𝑠 ∣ and ∣ 𝑁𝑡 ∣ are the number of nodes in the partitions 𝑁𝑠 The proposed potential-based reward shaping method relies on
and 𝑁𝑡 of graph 𝐺, respectively, and ∣ 𝐴(𝑁𝑠 , 𝑁𝑡 ) ∣ is the number of arcs the state–space segmentation information extracted by the ESegQ-Cut
connecting both partitions. Based on this, a high-quality or significant algorithm. The aim of this method is to use the shaped rewards in
cut is defined as a cut with a small number of arcs separating balanced the learning process of the RL agent which are computed using a
partitions in the transition graph. potential-based reward shaping function that depends on the segment
The Segmented Q-Cut algorithm extends the idea of Q-Cut by using information of the states. The state–space segmentation is obtained
previously identified subgoals for state–space partitioning and then by translating the experiences of the agent in the environment into
finding additional subgoals from generated partitions. The outline of a transition graph and then applying the ESegQ-Cut algorithm. The
the approach is given in Algorithm 2. In this approach, the cut proce- method is represented by a nested structure, as illustrated in Fig. 1.
dure is not applied only once, but for each created partition as given In particular, the method consists of a Q-Segmenter agent designed
in Algorithm 3. with a nested structure. The Q-Segmenter agent behaves as the learn-
ing component in this model. It interacts with the environment and
4. Proposed approach
approximates the state–action values with a Q-table named 𝑄𝑆𝐸𝐺 by
In this section, we explain the motivation for the learning effi- applying the Q-Learning algorithm [28]. Q-Segmenter agent derives
ciency problem in RL literature (Section 4.1) and present the proposed the additional information on the environment through its Segmenter
method, namely potential-based reward shaping using state–space seg- and Subgoal-Identifier inner components. This additional information
mentation with Extended Segmented Q-Cut (ESegQ-Cut) algorithm which is utilized in the agent’s learning process via a reward-shaping mech-
aims to solve this problem (Section 4.2). While we elaborate on the anism. After each transition, Q-Segmenter agent uses shaped rewards
process of state–space segmentation in Section 4.2.1, we formulate when updating its state–action value estimates instead of directly using
the potential-based reward shaping with identified state–space segment extrinsic environmental reward signals. The agent computes the shaped
information in Section 4.2.2. rewards using the information on the state–space segmentation.

472
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Fig. 1. The model of the proposed method.

Q-Segmenter agent has a Segmenter component whose main role is episode. Training of Q-Segmenter is completed after the agent performs
to compute the state–space segments using the subgoals identified by a pre-determined number of episodes. The details of the methods are
its inner component called Subgoal-Identifier and then convey the ex- given in the following sections.
tracted segments information to the Q-Segmenter agent. Moreover, the
Segmenter behaves as a bridge between the Q-Segmenter agent and the Algorithm 4 Learning with Potential-based Reward Shaping Using
Subgoal-Identifier component by transferring the transition experience State-Space Segmentation
of the Q-Segmenter denoted as a tuple ⟨𝑠, 𝑎, 𝑠′ ⟩ to the Subgoal-Identifier 1: procedure learnWithESegQCut
component. 2: require: ⟨S, A, T , R ⟩,
The innermost element of this model is the Subgoal-Identifier whose learning rate 𝛼 ∈ (0, 1], exploration rate 𝜀 ∈ (0, 1],
main task is to first translate the transition experiences of the Q- discount factor 𝛾 ∈ (0, 1], step limit L ≥ 1,
Segmenter into a transition graph up to a specific period and then, cut quality threshold 𝑐𝑞 ≥ 1,
extract the subgoals by performing the ESegQ-Cut algorithm on the num. of episodes 𝑀 ≥ 1,
final transition graph. The tasks of this component also include trans- num. of episodes for random walk 𝑀𝑐𝑢𝑡 ≥ 1
ferring the information of identified subgoals to the outer Segmenter 3: 𝑄𝑆𝐸𝐺 (𝑠, 𝑎) ← 0, ∀𝑠 ∈ S, 𝑎 ∈ A
component. 4:  ← [], V ← [] ⊳ Initialize Segments & Segment Values
The pseudocode for the method is given in Algorithm 4. The al- 5: for 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 ← 0 𝐭𝐨 𝑀 do
gorithm takes the necessary inputs for applying the Q-Learning [28] 6: if 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 is 0 then
approach along with the inputs related to extracting state–space seg- 7:  ← randomWalk(L, 𝑀𝑐𝑢𝑡 , 𝑐𝑞 ,  )
ment information such as the number of episodes for performing the 8: if  is not empty then
random walk and cut-quality threshold. The method starts with initial- 9: V ← getSegmentValues()
izing a Q-table 𝑄𝑆𝐸𝐺 for each state–action pair, required lists to hold 10: end if
the segments and the values of these segments ( and V respectively), 11: Initialize 𝑠 ∈ S
and then setting the episode number to zero. At the beginning of 12: while 𝑠 is not terminal or L is not reached do
the initial episode, before the learning of Q-Segmenter agent starts, 13: Choose 𝑎 ← EPSILON-GREEDY(𝑄𝑆𝐸𝐺 , 𝜀)
Q-Segmenter completes a random walk phase in the environment via 14: Take action 𝑎, observe 𝑟, 𝑠′
randomWalk() procedure for the sake of identifying state–space 15: 𝑟̃ ← shapeReward(𝑟,  , V )
segments. The agent obtains the state–space segments after completing 16: 𝛿 ← 𝑟̃ + 𝛾 max𝑎′ 𝑄𝑆𝐸𝐺 (𝑠′ , 𝑎′ ) − 𝑄𝑆𝐸𝐺 (𝑠, 𝑎)
the random walk and computes the values of the segments via the 17: Update 𝑄𝑆𝐸𝐺 (𝑠, 𝑎) ← 𝑄𝑆𝐸𝐺 (𝑠, 𝑎) + 𝛼𝛿
getSegmentValues() procedure if any segment is identified. With 18: 𝑠 ← 𝑠′
the knowledge of segments and their values, Q-Segmenter agent starts 19: end while
the learning phase by applying the Q-Learning algorithm. Compared to 20: V ← updateSegmentValues()
the standard Q-Learning algorithm, Q-Segmenter agent shapes its envi- 21: end if
ronmental reward signal 𝑟 after taking action 𝑎 in state 𝑠. The shaped 22: end for
reward signal is denoted by 𝑟̃ and computed with the shapeReward() 23: return 𝑄𝑆𝐸𝐺
procedure. Q-Segmenter then uses 𝑟̃ in the update rule for its Q-table 24: end procedure
𝑄𝑆𝐸𝐺 . Furthermore, the agent also updates the segment values with
the updateSegmentValues() method after each episode.
The flowchart of the learning process with potential-based reward 4.2.1. State–space segmentation
shaping using state–space segmentation with ESegQ-Cut algorithm is Random walk. In the initial episode of the learning, Q-Segmenter goes
presented in Fig. 2. In summary, the learning process starts with the through a random walk phase in the environment to identify state–
agent’s random walk in the environment for a specific period and space segments. The general idea of this phase is illustrated in Fig. 3.
the accumulation of the transitions on a transition graph. After the The phase starts with the interaction between Q-Segmenter and the
random walk is completed, the ESegQ-Cut algorithm is applied to environment. Through the random walk, the agent generates trajecto-
the transition graph, and the state–space segments are constructed ries, and collected trajectories from all episodes of the random walk
using the identified subgoals and weakly connected components on the phase are transformed into a transition graph. The transition graph
transition graph excluding these subgoals. Following this segmentation is then given to the Extended Segmented Q-Cut algorithm. At this
of the state space, training of Q-Segmenter agent starts. The agent stage, first Segmented Q-Cut algorithm is applied and subgoal states
interacts with the environment and shapes the environmental reward are determined in the graph, then using these subgoals, the segments
signal using the extracted knowledge on state–space segments. It then of the state–space are extracted. The random walk phase is finalized by
updates its Q-value estimates with the Q-Learning update rule. At the returning the identified segment information to the learning phase.
end of each episode, the agent updates its knowledge of state–space The random walk explained in Algorithm 5 continues for 𝑀𝑐𝑢𝑡
segments in terms of their value which will be utilized in the following many episodes. Similar to the Q-Learning algorithm, each episode

473
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Algorithm 5 Random Walk phase


1: procedure randomWalk
2: require: ⟨S, A, T , R ⟩, L, 𝑀𝑐𝑢𝑡 , 𝑐𝑞 , 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 
3: Graph 𝐺 = ⟨𝑁, 𝐴⟩, 𝑁 ← ∅, 𝐴 ← ∅
4: for 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 ← 0 𝐭𝐨 𝑀𝑐𝑢𝑡 do
5: Initialize 𝑠 ∈ S
6: while 𝑠 is not terminal or L is not reached do
7: Randomly choose an action 𝑎 ∈ A
8: Take action 𝑎, observe 𝑟, 𝑠′
9: Add the transition ⟨𝑠, 𝑎, 𝑠′ ⟩ to the graph 𝐺
10: Increment the frequency of observing transition 𝑠 → 𝑠′
11: Increment the frequency of visiting state 𝑠
12: Adjust the arc capacity of (𝑠, 𝑠′ ) w/ updated frequencies
13: 𝑠 ← 𝑠′
14: end while
15: end for
16:  ← ESegQ-Cut(𝐺, 𝑐𝑞 ,  )
17: return 
18: end procedure

two segments and proceed with each created segment. The quality of
a cut is expressed as the ratio cut bi-partitioning metric as defined in
(12). Since good quality cuts should have a small number of arcs while
separating significant balanced areas, only the cuts having a quality
level greater than the pre-determined threshold 𝑐𝑞 are acceptable. Thus,
Fig. 2. The flowchart of the learning process with the proposed method.
the Segmented Q-Cut method runs until no new cut that satisfies the
quality condition is identified. At the end of this method, subgoals are
detected as border states at the cuts.
of the random walk ends until the terminal (goal) state is found or
The extension of the Segmented Q-Cut method starts after obtaining
the step limit (L) is reached. At each time step in an episode, the
the subgoals. To extract the state–space segments, we first search for
agent chooses a random action 𝑎 ∈ A at a state 𝑠 and observes its
adjacent subgoals in the identified 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 list (X ). We group the
consequences as a reward signal 𝑟 and next state 𝑠′ . The tuple ⟨𝑠, 𝑎, 𝑠′ ⟩
adjacent subgoals and then for each group, we choose the subgoal
experienced by the agent is added to the transition graph 𝐺. At each
time the transition ⟨𝑠, 𝑎, 𝑠′ ⟩ is added to the graph 𝐺, the frequency of having a minimum degree and append it to the 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 list
observing the transition from 𝑠 to 𝑠′ is incremented by one. Moreover, (XS ). We select the subgoal with the minimum degree within the group
the frequency of visiting state 𝑠 is also incremented by one. Then, since because it adheres to the definition of a significant i.e. high-quality cut.
the arc capacity is defined in terms of relative frequency as defined in As explained in Menache et al.’s study [1], a significant source–sink cut
(11), the capacity of the arc (𝑠, 𝑠′ ) that reflects the transition 𝑠 → 𝑠′ (𝑠 − 𝑡 cut) is a cut with a small number of arcs while creating enough
is adjusted accordingly using these updated frequencies. After 𝑀𝑐𝑢𝑡 states both in the source segment 𝑁𝑠 and the sink segment 𝑁𝑡 . To this
episodes are completed in the random walk, the transition graph is end, we aim to obtain only significant subgoals with this extension.
accumulated with the agent’s experiences. The graph is then given to To move from subgoals to state–space segments, nodes denoting
the ESegQ-Cut() method as an input that first computes the subgoals the selected significant subgoals are removed from the transition graph
and then extracts the segments of the state-space. Finally, the extracted 𝐺. With the Union Find algorithm, weakly connected components of
state–space segmentation information is returned to the Q-Segmenter the resulting graph are detected. Each weakly connected component
agent for use in the learning process. is treated as a new segment and added to  . Finally, each identified
subgoal in XS is also added to  as a separate segment. The procedure
Extended segmented Q-cut. The main idea of the ESegQ-Cut method is terminates after the extracted  are returned.
to identify the subgoals with the Segmented Q-Cut algorithm, and then
extract the state–space segments using the identified subgoals as can be
seen in Algorithm 6. 4.2.2. Reward shaping based on state–space segmentation
The procedure starts with appending all the nodes in transition Within the completion of the random walk procedure, the Q-
graph 𝐺 to the list of 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 ( ) and initializing the required lists Segmenter agent gains knowledge of the environment through extracted
to store subgoal-related information. In addition, the 𝑠𝑜𝑢𝑟𝑐𝑒 and 𝑠𝑖𝑛𝑘 state–space segments. The agent should then benefit from this knowl-
nodes are also determined at this point. Since the agent may start at a edge in the learning phase. However, the question is: How should the
different state in each episode of the random walk, the 𝑠𝑜𝑢𝑟𝑐𝑒 node is agent utilize state–space segment information in its learning process
selected as a dummy node added to the graph. Similarly, each episode to speed up the learning? The proposed method suggests applying
of the random walk may terminate at the goal state or an arbitrary state potential-based reward shaping using segment information in terms of
depending on the step limit L. Therefore, the 𝑠𝑖𝑛𝑘 node is selected as values. The basic idea is to compute the values of the segments and
the goal state if it exists in the graph 𝐺, otherwise, a dummy goal state reflect the potential of the states in terms of segment values. Depending
is chosen. on the potential, the environmental reward signals are shaped and
The procedure continues with the application of the Segmented utilized in the Q-value estimations. The illustration that summarizes the
Q-Cut algorithm with the Cut() method explained in Algorithm 7. learning phase is given in Fig. 4. In the following sections, a potential-
While we can identify good-quality cut point(s) on the graph which based reward-shaping strategy depending on the value of segments is
is denoted by the boolean variable 𝑑𝑖𝑣𝑖𝑑𝑒, we divide the graph into presented.

474
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Fig. 3. A schematic representation for random walk phase.

Fig. 4. A schematic representation for the learning phase.

Value of the segments. The value of a segment 𝑖 denoted by 𝑣𝑠𝑒𝑔𝑖 , 𝑖 ∈  , If any segment is found by the completion of the random walk
is defined as the expected return starting from segment 𝑖 and computed procedure, the agent computes the segment values with getSeg-
as the average Q-value of all possible states in segment 𝑖 for all possible mentValues procedure as given in Algorithm 8 using the rule (13).
actions with Since Q-values for each state–action pair are initialized to zero at the
∑ ∑𝑎∈A 𝑄(𝑠,𝑎) beginning of the first episode, the initial values of the segments also
𝑠∈𝑖 ∣A∣ become zero. However, as the Q-values are updated during the learning
𝑣𝑠𝑒𝑔𝑖 = , (13) process, the segment values should be changed accordingly. By doing
∣𝑖∣
[ ] this, the agent will be able to determine which segment is more useful
where 𝑄(𝑠, 𝑎) ≐ E 𝐺𝑡 ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 shows the expected return starting to reach the goal state. Therefore, Q-Segmenter updates the segment
from state 𝑠 and taking action 𝑎, ∣ A ∣ represents the size of the action- values at the end of each episode by the updateSegmentValues
space and ∣ 𝑖 ∣ denotes the size of the segment 𝑖 i.e. the number of states method. As explained in more detail in Algorithm 9, this method
that belong to segment 𝑖. updates the segment values with the update rule
The average state–action value is shown to be an effective Q-value 𝑣𝑛+1 𝑛 𝑛+1 𝑛
(14)
𝑠𝑒𝑔 ← 𝑣𝑠𝑒𝑔 + 𝛼(𝛾𝑣𝑠𝑒𝑔 − 𝑣𝑠𝑒𝑔 ),
𝑖 𝑖 𝑖 𝑖
update rule, specifically in swarm reinforcement learning [34], as it
captures the diversity of Q-values among agents’ individual learning. where 𝛼 ∈ (0, 1] is the learning rate, 𝛾 ∈ (0, 1] is the discount factor and
Motivated by the insight from [34], we capture the diversity of Q-values 𝑛 ∈ [0, 𝑀] denotes the episode number.
belonging to a segment via segment value definition (13). However, Potential-based reward shaping using values of the segments. After the
in Section 5.4.1, we present the analysis where we explore a different initial computation of segment values, Q-Segmenter starts interacting
segment value definition based on optimistic initialization [26]. with the environment by applying the Q-Learning algorithm.

475
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Algorithm 6 Extended Segmented Q-Cut Algorithm Algorithm 8 Computation of the Segment Values
1: procedure ESegQCut 1: procedure getSegmentValues
2: require: Graph 𝐺 = ⟨𝑁, 𝐴⟩, 𝑐𝑞 , 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠  2: require: 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠  , 𝑠𝑒𝑔𝑚𝑒𝑛𝑡_𝑣𝑎𝑙𝑢𝑒𝑠 V , 𝑄𝑆𝐸𝐺
3:  ← all nodes in the graph 𝐺 3: for each segment 𝑖 in  do
4: 𝑠𝑜𝑢𝑟𝑐𝑒 ← dummy node 4: 𝑠𝑢𝑚 ← 0
5: 𝑠𝑖𝑛𝑘 ← node denoting the goal state if exists in 𝐺, 5: for each state 𝑠 in

𝑖 do
otherwise dummy goal 𝑎∈A 𝑄𝑆𝐸𝐺 (𝑠,𝑎)
6: 𝑠𝑢𝑚 ← 𝑠𝑢𝑚 + ∣A∣
6: 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 X ←[ ], 𝑠𝑢𝑏𝑔𝑜𝑎𝑙 𝑔𝑟𝑜𝑢𝑝𝑠 XG ←[ ], 7: end for
7: 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 XS ←[ ] 8:
𝑠𝑢𝑚
V𝑖 ← ∣ ∣
8: 𝑑𝑖𝑣𝑖𝑑𝑒 ← 𝑇 𝑟𝑢𝑒 𝑖
9: end for
9: while 𝑑𝑖𝑣𝑖𝑑𝑒 do
10: return V
10: for each segment in the  do
11: end procedure
11: 𝑑𝑖𝑣𝑖𝑑𝑒 ← 𝑑𝑖𝑣𝑖𝑑𝑒 and Cut(𝐺, 𝑐𝑞 ,  , 𝑠𝑜𝑢𝑟𝑐𝑒, 𝑠𝑖𝑛𝑘, X )
12: end for
13: end while Algorithm 9 Update of the Segment Values
14: X ← getSubgoals()
1: procedure updateSegmentValues
15: XG ← Identify the adjacent subgoals in X , if any
2: require: 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠  , 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠 V , 𝛼, 𝛾
16: for each 𝑔𝑟𝑜𝑢𝑝 in XG do
3: 𝑛𝑒𝑤 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠 V ′ ← getSegmentValues()
17: Append subgoal having min. degree in the 𝑔𝑟𝑜𝑢𝑝 to XS
4: for each segment 𝑖 in  do
18: end for
5: 𝛥𝑠𝑒𝑔 ← 𝛾(V ′ ) − V𝑖
19: for each node 𝑛 in XS do 𝑖
20: 𝐴 ← 𝐴 ⧵ {(𝑛, 𝑖), (𝑖, 𝑛)}, ∀𝑖 ∈ 𝑁 6: V𝑖 ← V𝑖 + 𝛼 ⋅ 𝛥𝑠𝑒𝑔
21: 𝑁 ← 𝑁 ⧵ {𝑛} 7: end for
22: end for 8: return V
23: 𝑤𝑐𝑐 ← Union Find(𝐺) ⊳ weak. conn. comp. of resulting 𝐺 9: end procedure
24:  ← 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 ∪ {𝑐}, ∀𝑐 ∈ 𝑤𝑐𝑐
25:  ← 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 ∪ {𝑥}, ∀𝑥 ∈ XS
26: return 
where  is the set of all segments, 1𝑠∈𝑖 is the indicator function that
27: end procedure
takes the value of 1 if state 𝑠 is in segment 𝑖, and 0 otherwise, while
𝑣𝑠𝑒𝑔𝑖 represents the value of the segment 𝑖. A state can be an element
of only one segment. From this, we can conclude that the potential of
Algorithm 7 Cut Procedure of Extended Segmented Q-Cut
a state shows the value of the segment to which the state belongs.
1: procedure Cut We define the potential-based reward shaping function 𝐹 , 𝐹 ∶ S ×
2: require: Graph 𝐺 = ⟨𝑁, 𝐴⟩, 𝑐𝑞 , 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠  S → R in terms of the difference between the potentials of states 𝑠 and
𝑠𝑜𝑢𝑟𝑐𝑒, 𝑠𝑖𝑛𝑘, 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 X 𝑠′ for transition 𝑠 → 𝑠′ as
3: Subgraph 𝐺𝑠𝑢𝑏 = ⟨𝑁𝑠𝑢𝑏 , 𝐴𝑠𝑢𝑏 ⟩, 𝑁𝑠𝑢𝑏 ← ∅, 𝐴𝑠𝑢𝑏 ← ∅
𝐹 (𝑠, 𝑠′ ) = 𝛾𝛷(𝑠′ ) − 𝛷(𝑠)
4: 𝐺𝑠𝑢𝑏 ← Induced subgraph of the graph 𝐺 containing ∑ ∑
the nodes in  and the arcs between those nodes =𝛾 1𝑠′ ∈𝑖 𝑣𝑠𝑒𝑔𝑖 − 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖
(16)
𝑖∈ 𝑖∈
5: 𝑚𝑖𝑛_𝑐𝑢𝑡_𝑣𝑎𝑙𝑢𝑒, 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 ← minCut(𝐺𝑠𝑢𝑏 , 𝑠𝑜𝑢𝑟𝑐𝑒, 𝑠𝑖𝑛𝑘)
6: Obtain 𝑁𝑠 and 𝑁𝑡 from 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 = 𝛾𝑣𝑠𝑒𝑔𝑘 − 𝑣𝑠𝑒𝑔𝑗 ,
7: 𝑛𝑠 ←∣ 𝑁𝑠 ∣ ⊳ size of source segment where 𝑠′ ∈ segment 𝑘, 𝑠 ∈ segment 𝑗 and 𝑘, 𝑗 ∈  .
8: 𝑛𝑡 ←∣ 𝑁𝑡 ∣ ⊳ size of sink segment
9: 𝐴𝑠,𝑡 ← number of nodes connecting 𝑁𝑠 & 𝑁𝑡 Theorem 1. Let MDP 𝑀 = ⟨S, A, T , R , 𝛾⟩ and transformed MDP 𝑀 ′ be
𝑛 ⋅𝑛
10: 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 ← 𝐴𝑠 𝑡 defined as 𝑀 ′ = ⟨S, A, T , R ′ , 𝛾⟩ where R ′ = R + 𝐹 and the shaping reward
𝑠,𝑡
11: if 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 > 𝑐𝑞 then function 𝐹 ∶ S × S → R is defined as in (16). The potential-based reward
12: Identify new 𝑠𝑜𝑢𝑟𝑐𝑒𝑠 and 𝑠𝑖𝑛𝑘𝑠 for 𝑁𝑠 shaping function 𝐹 defined in (16) preserves the policy invariance i.e. every
13: Identify new 𝑠𝑜𝑢𝑟𝑐𝑒𝑡 and 𝑠𝑖𝑛𝑘𝑡 for 𝑁𝑡 optimal policy in 𝑀 ′ is also an optimal policy in 𝑀 (and vice versa).
14: X ← X ∪ {𝑠𝑖𝑛𝑘𝑠 , 𝑠𝑜𝑢𝑟𝑐𝑒𝑡 }
15:  ← {𝑁𝑠 , 𝑁𝑡 } Proof. The optimal Q-function for the original MDP 𝑀, denoted as 𝑄∗𝑀
16: return True ⊳ another cut is possible satisfies the Bellman optimality equation:
17: end if [ ]
( )
18: return False 𝑄∗𝑀 (𝑠, 𝑎) = E𝑠′ R (𝑠, 𝑎) + 𝛾 max 𝑄∗𝑀 𝑠′ , 𝑎′ . (17)
𝑎′ ∈𝐴
19: end procedure
When we subtract 𝛷(𝑠) from both sides
[ ( )
𝑄∗𝑀 (𝑠, 𝑎) − 𝛷(𝑠)
= E𝑠′ R (𝑠, 𝑎) + 𝛾𝛷 𝑠′ − 𝛷(𝑠)
]
( ( ) ( )) (18)
However, it shapes the reward signals received from the environment +𝛾 max 𝑄∗𝑀 𝑠′ , 𝑎′ − 𝛷 𝑠′ .
𝑎′ ∈𝐴
with potential-based reward shaping by the means of the state–space

segmentation and uses shaped rewards to update its Q-value estimates Since 𝛷(𝑠) = 𝑖∈ 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖 , we get

for state–action pairs. 𝑄∗𝑀 (𝑠, 𝑎) − 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖 = E𝑠′ [R (𝑠, 𝑎)
𝑖∈
The potential of a state 𝑠, denoted with the function 𝛷(𝑠), 𝛷 ∶ S → R ∑ ∑
+𝛾 1𝑠′ ∈𝑖 𝑣𝑠𝑒𝑔𝑖 − 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖
is defined by (19)
𝑖∈ 𝑖∈
( )]
∑ ( ) ∑
𝛷(𝑠) = 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖 , (15) +𝛾 max 𝑄∗𝑀 𝑠′ , 𝑎′ − 1𝑠′ ∈𝑖 𝑣𝑠𝑒𝑔𝑖 .
′ 𝑎 ∈𝐴
𝑖∈ 𝑖∈

476
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Fig. 6. Locked Shortcut Six-Rooms domain.

Fig. 5. GridWorld experiment domains. Table 1


The size of the domains.
Domain Grid Size |SS | |A
A|
∑ Six-Rooms 32 × 21 606 4
If we define 𝑄𝑀 ′ (𝑠, 𝑎) = 𝑄∗𝑀 (𝑠, 𝑎) − 𝑖∈ 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖 and substitute
Zigzag Four-Rooms 43 × 10 403 4
∑ ∑ Locked Shortcut Six-Rooms 32 × 21 606 × 2 4
𝐹 (𝑠, 𝑠′ ) = 𝛾 1𝑠′ ∈𝑖 𝑣𝑠𝑒𝑔𝑖 − 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖
𝑖∈ 𝑖∈

back to (19), we obtain


[ ] 5. Experiments
( ) ( )
𝑄𝑀 ′ (𝑠, 𝑎) = E𝑠′ R (𝑠, 𝑎) + 𝐹 𝑠, 𝑠′ + 𝛾 max 𝑄𝑀 ′ 𝑠′ , 𝑎′
𝑎′ ∈𝐴
[ ] (20)
( ) This section presents computational experiments to evaluate the
= E𝑠′ R ′ (𝑠, 𝑎) + 𝛾 max 𝑄𝑀 ′ 𝑠′ , 𝑎′ .
𝑎′ ∈𝐴 learning performance of the proposed method (will be referred to as
the Q-Segmenter agent) using the parameter set given in Table 1.
The equation obtained in (20) is the Bellman optimality equation
applied to transformed MDP 𝑀 ′ . Thus, 𝑄𝑀 ′ must be an optimal state– In Section 5.1, the problem domains used in the experiments are
action value function since it satisfies the Bellman optimality equation. explained. The parameter settings in the experiments are provided in
Therefore, Section 5.2. Finally, experiment results and evaluation of the method
∑ are given in Section 5.3.
𝑄∗𝑀 ′ (𝑠, 𝑎) = 𝑄𝑀 ′ (𝑠, 𝑎) = 𝑄∗𝑀 (𝑠, 𝑎) − 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖 . (21)
𝑖∈
5.1. Problem domains
Following (21), the optimal policy for transformed MDP 𝑀 ′ satisfies
the equation:
∗ ∗ In the experiments to evaluate the learning speed of the Q-Segmenter
𝜋𝑀 ′ (𝑠) ∈ arg max 𝑄𝑀 ′ (𝑠, 𝑎)
𝑎∈𝐴
∑ agent, we used sparse reward problem benchmarks as diagnostic tasks.
= arg max 𝑄∗𝑀 (𝑠, 𝑎) − 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖 (22) Hence, sparse reward variations of GridWorld navigation domains,
𝑎∈𝐴
𝑖∈ which are Six-Rooms [1], Zigzag Four-Rooms [35] and Locked
= arg max 𝑄∗𝑀 (𝑠, 𝑎). □ Shortcut Six-Rooms [36] with domain sizes given in Table 1,
𝑎∈𝐴
were highly suitable settings for our experiments due to their delayed-
Since the effect of actions is eliminated in the segment values by
feedback nature.
averaging the Q-values over all possible actions as shown in (13), the
∑ As sketched in Figs. 5 and 6, the RL agent navigates in a GridWorld
optimal policy for 𝑀 ′ is not affected by the additional − 𝑖∈ 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖
′ domain that has several rooms connected by hallways starting from an
term. Hence, the optimal policy for 𝑀 becomes also optimal for 𝑀 and
policy invariance is preserved with the policy-based reward shaping arbitrary state in the upper-left room. The goal of the agent in all of
function 𝐹 defined in (16). Similarly, we can prove that the optimal the domains is to reach the goal state labeled with the letter 𝐺. The
policy for 𝑀 will also be optimal in 𝑀 ′ by considering the reward state is defined as the ⟨𝑥, 𝑦⟩ coordinates of the agent within the size of
shaping function −𝐹 . the state–space given in Table 1, and the possible action set in a state
The shaped reward signal for the transition of the Q-Segmenter agent is determined as the set of four actions: north, south, east, and west. The
time step 𝑡 becomes sparse reward function gives a reward of +10 to the agent if it reaches
the goal state and 0 to any other movements in the domain.
𝑟̃𝑡 = 𝑟𝑡 + 𝐹 (𝑠𝑡 , 𝑠𝑡+1 ). (23) Furthermore, we also experimented on a more complex version of
Depending on the transition of the agent, the shaped reward can the Six-Rooms domain called Locked Shortcut Six-Rooms. In
take positive or negative values. If the agent transitions to a state with this problem, the southwest door is locked until the state with the key
a greater segment value than the current one, then 𝐹 will behave is visited as illustrated in Fig. 6. The state is defined as a tuple ⟨𝑥, 𝑦, 0∕1⟩
as a bonus for the agent. On the other hand, the agent will receive that shows the ⟨𝑥, 𝑦⟩ coordinates of the agent’s location and whether the
a punishment for visiting a state with a lower segment value. Even agent has the key i.e. visited the state with the key or not. Similar to
though the agent traverses in the same segment, 𝐹 will take a negative the Six-Rooms, the action space consists of north, south, east, and west.
value due to the discount factor. This is especially helpful in sparse In each episode of learning, the agent always starts from the southwest
reward environments since the agent receives frequent feedback while corner of the grid and tries to reach the goal state denoted with 𝐺. This
being in the same segment and it may encourage the agent to complete is also a sparse reward environment as the agent receives a reward of
the task as fast as possible. +10 for reaching the goal state or 0 for all other cases in an episode.

477
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Fig. 7. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Six-Rooms GridWorld domain with a limit of 1000 steps per episode.

Table 2 original Q-Learning algorithm as the benchmarks [1]. Furthermore, we


Parameter settings for the experiments. compared the learning performances in terms of measures such as the
Parameter Value average number of steps taken to reach the goal state, average reward
Learning rate 𝛼 0.3 per step, and average elapsed time required for learning. The average
Discount rate 𝛾 0.9 reward per step is computed by dividing the total reward in an episode
Exploration probability 𝜀 linearly decaying from 0.1 to 0.05
by the episode length. As we run each experiment 50 times, the average
Number of episodes 𝑀 1000
Step limit L domain-dependent performance of all experiments is reported in Table 3.
Number of episodes for random walk 𝑀𝑐𝑢𝑡 25 For the realization of the Macro Q-Learning method, we applied a
Cut quality threshold 𝑐𝑞 1000 random walk in early episodes of the training exactly as in the case of
the proposed Q-Segmenter method. Then the Seg-Q cut method is exe-
cuted to extract subgoals. Based on these subgoals, the corresponding
options are generated and trained with the Experience Replay approach
5.2. Experiment settings to terminate in subgoal states over the past episodes of the learning
agent as in [1]. Table B.5 in Appendix B provides the settings regarding
We fixed the parameter values as given in Table 2 for the exper- intra-option training phase of the Macro Q-Learning method.
imental study and used two different step limit (L) for an episode in As can be noted from Table 3, Q-Segmenter outperforms Q-Learning
the problem domains. We set this parameter to 1000 and 2000 for the and Macro-Q Learning in all domains regarding the number of steps to
Six-Rooms, 2000 and 5000 for the Zigzag Four-Rooms, and 3000 reach the goal state and average reward measures. It also converges to
and 5000 for the Locked Shortcut Six-Rooms environments. By better results in all domains. This indicates that although Q-Segmenter
doing so, we aimed to observe the impact of having a limited amount is able to identify at least two segments in each domain, the PBRS
of interaction time on the agent’s learning and whether our proposed mechanism contributes significantly to improving the agent’s learning.
method speeds up learning in such a case or not. In terms of elapsed time, there is no clear dominant method since each
We performed the experiments using workstation that has Intel® method can require shorter time in different domains compare to oth-
Core™ i7 3.10 GHz processors and 16 GB RAM with 64-bit Microsoft ers. However, when the time for the learning phase is considered, we
Windows 10 operating system. We replicated each experiment 50 times can conclude that there is no substantial difference between learning
and presented the average learning performances in the following times.
section. All the source code of our method and its experimental settings When we compare the learning speeds from Figs. 7 to 12, we notice
for the sample domains are publicly available online.1 that the learning is much faster than Q-Learning. This shows us that Q-
Segmenter learns before 1000 episodes are completed. If we also account
5.3. Experiment results and discussion for the observation that Q-Segmenter is able to learn much faster than
Q-Learning, then the training time for Q-Segmenter does not need to be
that long. Thus, the actual elapsed time required for learning will be
In order to compare our method with the existing approaches in
shorter. As a result, our proposed method indeed accelerates learning in
the literature, we chose 3 widely used Deep RL methods, namely
problem domains with sparse explicit reward structures. Unfortunately,
Advantage Actor Critic (A2C) [37], Deep Q-Network (DQN) [38] and
the Macro Q-Learning algorithm either converges to a non-optimal
Proximal Policy Optimization (PPO) [39]. We used the stable-baseline3
policy or fails to learn the task in sample domains. To the best of
codebase [40] to execute these models on selected problems. Although our knowledge, this can be related to the step limit (L) applied in the
we applied an extensive search on the parameters for fine-tuning these termination of episodes.
models (see Tables B.6 and B.8 in Appendix B), we could not obtain Finally, to illustrate the identified state–space segments, we provide
any convergence on a stable solution for the sample problems. In other three examples that show segments and the set of cuts on the transition
words, all of the parameter settings result in the termination of the graph after the random walk phase, i.e. the first 25 episodes are
episodes with the step limit (L) we applied. This is not a surprising completed for each domain. As can be seen from Figs. 13 to 14, the Seg-
outcome considering the motivation behind these models. Deep models menter component is able to identify each room as a separate segment.
mostly perform well with high-dimensional and/or visual data, yet However, in some cases, it aggregates several rooms and gathers them
when the sensory data of the agent is assumed to be limited, the perfor- in a single segment as depicted in Fig. 13. These are only the examples
mances of these models severely get worse [41–43]. Implementations from a single run, hence we observed other combinations of segments
for the experimentation of these models are publicly online.2 (at least two segments) in the rest of the experiments. For instance, it is
Hence, we determined the Macro Q-Learning algorithm which uses observable from Fig. 14 that Q-Segmenter could not identify each room
the SegQ-Cut method to extract subgoals to generate options, and the as a separate segment. Although finding accurate segments provides
much better enhancement on learning performance, this is not crucial
for our approach, as long as the learning agent is guided to the proper
1
[https://github.com/melisilaydabal/pbrs_seg] segment via a valid potential value as experimentally proven in our
2
[https://github.com/melisilaydabal/sb3_comparison] study.

478
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Table 3
Overall performance comparison of the proposed method Q-Segmenter with benchmarks.
Problem Method Average Steps (stdev) Average Reward (stdev) Average Average Elapsed Time (sec) for
Elapsed Time
sec (stdev)

over of the over of the Random Walk Learning


all episodes last episode all episodes last episode Phase (stdev) Phase (stdev)

Q-Segmenter 696.33 (166.83) 444.04 (458.39) 0.06 (0.04) 0.13 (0.11) 121.26 (45.17) 20.58 (9.09) 100.67 (45.83)
Six-rooms
Q-Learning 984.99 (17.94) 929.40 (228.14) 0.00 (0.00) 0.01 (0.05) 93.71 (7.15) – –
(1000 steps)
Macro Q-Learning 998.89 (0.95) 999.00 (0.00) 0.00 (0.01) 0.00 (0.00) 54.70 (7.99) 1.22 (0.02) 53.47 (7.98)

Q-Segmenter 198.59 (391.22) 43.70 (5.19) 0.18 (0.08) 0.23 (0.03) 67.45 (6.70) 33.47 (1.95) 33.98 (6.09)
Six-rooms
Q-Learning 741.70 (800.56) 46.94 (5.20) 0.13 (0.09) 0.22 (0.02) 70.48 (14.89) – –
(2000 steps)
Macro Q-Learning 1369.72 (150.26) 1283.40 (932.08) 0.01 (0.04) 0.00 (0.00) 63.84 (33.16) 2.46 (0.07) 61.39 (33.12)

Q-Segmenter 1808.44 (89.51) 1489.18 (806.83) 0.01 (0.00) 0.03 (0.05) 196.59 (42.78) 6.94 (3.96) 189.64 (46.54)
Zigzag-four-rooms
Q-Learning 1924.65 (104.13) 1562.38 (755.86) 0.00 (0.00) 0.02 (0.05) 186.77 (12.23) – –
(2000 steps)
Macro Q-Learning 1998.64 (2.46) 1999.00 (0.00) 0.00 (0.01) 0.00 (0.00) 133.5 (12.31) 3.77 (0.08) 129.73 (12.31)

Q-Segmenter 186.45 (473.57) 74.00 (6.39) 0.11 (0.03) 0.14 (0.01) 60.93 (4.44) 30.34 (0.84) 30.58 (4.57)
Zigzag-four-rooms
Q-Learning 1169.62 (1742.68) 75.74 (5.40) 0.09 (0.06) 0.13 (0.01) 116.71 (18.78) – –
(5000 steps)
Macro Q-Learning 4491.01 (56.40) 4502.60 (1478.06) 0.01 (0.07) 0.00 (0.00) 214.33 (67.97) 5.61 (0.19) 208.72 (67.94)

Q-Segmenter 384.69(669.94) 54.00(2.26) 0.14(0.06) 0.19(0.01) 130.08 (27.19) 53.66 (4.29) 76.42 (28.05)
Locked Shortcut Six-rooms
Q-Learning 1257.07(1301.60) 56.00(3.55) 0.10(0.08) 0.18(0.01) 116.20 (19.80) – –
(3000 steps)
Macro Q-Learning 2842.97 (39.81) 2881.30 (576.61) 0.00 (0.02) 0.00 (0.00) 221.29 (46.50) 6.19 (0.19) 215.11 (46.48)

Q-Segmenter 308.39(672.78) 56.30(3.27) 0.14(0.06) 0.18(0.01) 180.84 (15.43) 66.19 (2.38) 114.64 (15.04)
Locked Shortcut Six-rooms
Q-Learning 840.56 (1566.00) 56.70(4.30) 0.01(0.01) 0.02(0.00) 84.58 (12.40) – –
(5000 steps)
Macro Q-Learning 4028.74 (257.89) 3813.42 (2109.76) 0.01 (0.11) 0.00 (0.00) 210.85 (93.40) 6.28 (0.26) 204.58 (93.33)

Fig. 8. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Six-Rooms GridWorld domain with a limit of 2000 steps per episode.

Fig. 9. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Zigzag Four-Rooms GridWorld domain with a limit of 2000 steps per
episode.

Fig. 10. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Zigzag Four-Rooms GridWorld domain with a limit of 5000 steps
per episode.

479
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Fig. 11. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Locked Shortcut Six-Rooms GridWorld domain with a limit of
3000 steps per episode.

Fig. 12. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Locked Shortcut Six-Rooms GridWorld domain with a limit of
5000 steps per episode.

Fig. 13. Segments and cuts on the graph after random walk phase for 25 episodes is completed in Six-Rooms GridWorld domain. Each node on the transition graph represents
a state, whereas arcs are the possible actions taken to reach the pointed states. The white nodes and the white rectangle nodes symbolize source and sink states, respectively.
The nodes with the same color belong to the same state segment. The highlighted green nodes show the identified subgoal states and the accompanied red dashed arcs are the
minimum cuts. As illustrated, the agent identified 5 state–space segments in Six-Rooms GridWorld domain.

5.4. Analysis average Q-value of all states belonging to the segment, also provided
in (13). However, based on the optimistic initialization idea [26], we
In this section, we further analyze the specific design choices of can define the value of a segment 𝑖 optimistically using the rule
our approach, specifically, considering an alternative segment value ∑
definition in Section 5.4.1, and different values for critical hyperpa- 𝑎∈A 𝑄(𝑠, 𝑎)
𝑣𝑠𝑒𝑔𝑖 = max . (24)
rameters such as cut quality threshold, and the number of episodes for 𝑠∈𝑖 ∣A∣
the random walk in Section 5.4.2. [ ]
Here, 𝑄(𝑠, 𝑎) ≐ E 𝐺̃ 𝑡 ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 represents the expected return
5.4.1. Alternatives for the segment value starting from state 𝑠 and taking action 𝑎, while ∣ A ∣ denotes the size of
Motivated by the average-Q information exchange among swarm the action space. This rule suggests assigning the maximum state value
agents introduced in [34], we define the value of segments as the as the segment value, thereby encouraging exploration.

480
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Fig. 14. Segments and cuts on the graph after random walk phase for 25 episodes is completed in Locked Shortcut Six-Rooms domain. Each node on the transition graph
represents a state, whereas arcs are the possible actions taken to reach the pointed states. The white nodes and the white rectangle nodes symbolize source and sink states,
respectively. The nodes with the same color belong to the same state segment. The highlighted green nodes show the identified subgoal states and the accompanied red dashed
arcs are the minimum cuts. Based on this, it is clear that the agent identified 6 state–space segments in Locked Shortcut Six-Rooms domain.

Fig. 15. Learning performances of proposed method Q-Segmenter, Q-Segmenter-MaxQ with segment value definition given in (24) and Q-Learning for Six-Rooms GridWorld
domain with a limit of 1000 steps per episode.

Fig. 16. Learning performances of proposed method Q-Segmenter, Q-Segmenter-MaxQ with segment value definition given in (24) and Q-Learning for Six-Rooms GridWorld
domain with a limit of 2000 steps per episode.

We assessed the performance of the optimistic segment value vari- and 16, Q-Segmenter-MaxQ, leveraging enhanced exploration cou-
ant of the Q-Segmenter, referred to as Q-Segmenter-MaxQ, in com- pled with a reward shaping mechanism, exhibits superior performance
parison to our own approach within the Six-Rooms GridWorld in the initial learning episodes. However, in line with observations
domain, imposing step limits of 1000 and 2000. We employed the same in [26], the efficacy of optimistic initialization of values is inherently
performance metrics detailed in Section 5.3. As depicted in Figs. 15 temporary. Moreover, the agent may struggle to accurately identify

481
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Fig. 17. Learning performances of proposed method Q-Segmenter with different cut Fig. 18. Learning performances of proposed method Q-Segmenter with different number
quality threshold choices under Six-Rooms GridWorld domain with a limit of 1000 of random walk episodes, 𝑀cut , choices under Six-Rooms GridWorld domain with
steps per episode. a limit of 1000 steps per episode.

critical segments, such as those containing the goal state, due to the of random walk episodes is anticipated to yield a transition graph with
overestimation of the value of other segments. As segment values an increased number of nodes and arcs. However, this amplified size
are gradually updated, the impact of this overestimation diminishes. of the transition graph introduces challenges in accurately identifying
Nevertheless, this correction is insufficient to significantly expedite the high-quality cuts. Consequently, the misidentification of segments may
learning process, as demonstrated in the results. However, conducting lead to misguided learning on the part of the agent. From a technical
a more comprehensive analysis across diverse problem settings and point of view, employing a substantial number of random walk episodes
exploring various segment value definitions would provide a more may pose a significant bottleneck in real-world scenarios, especially
nuanced understanding. since even larger transition graphs are expected to be constructed by
the agent.
5.4.2. Exploration of parameter choices For this, we investigated the impact of 𝑀cut on the learning per-
Cut quality threshold. The quality of the partitioned state space, as formance in the Six-Rooms GridWorld domain while imposing a
articulated by the metric in (12), is heavily influenced by the chosen limit of 1000 steps per episode. As shown in Fig. 18, a notable trend
cut quality threshold. A desirable outcome involves achieving a bal- is seen: a reduction in the number of random walk episodes correlates
anced partition with a minimal number of arcs in the cut. Considering with improved performance. With extended random walk episodes, the
this, we conducted an exploration of various cut quality thresholds to agent becomes more susceptible to identifying suboptimal cuts, as the
assess their impact on the learning performance of the agent within transition graph is likely to expand, and the visitation frequency values
the Six-Rooms GridWorld domain with a limit of 1000 steps per for randomly explored states tend to increase. This, in turn, has a
episode. consequential effect on cut identification. From a technical standpoint,
In Fig. 17, we illustrate the comparison of learning performance a scenario with a limited number of random walk episodes poses less
w.r.t. the number of steps and average reward measures under different complexity for cut identification, as we anticipate a relatively smaller
cut quality thresholds. A poor learning performance is observed when state transition graph in such cases.
employing both low and high values for the cut quality threshold. In
the case of a low threshold, the agent tends to recognize cuts involving 6. Conclusion
states that are not highly useful, such as those that are not associated
with hallways. Conversely, higher threshold values result in the agent In this study we propose a potential-based reward shaping using
identifying a reduced number of subgoals, leading to crowded segments state–space segmentation with the extended segmented Q-Cut algorithm to
within the state space. If an initial cut is inaccurately identified, the overcome the slow learning in RL problems with sparse explicit reward
agent may propagate this error to subsequent segmentations in the structure. In this method, rather than using a reward shaping mecha-
cut procedure, primarily due to the stringent cut quality condition. nism to extract a piece of useful information about the environment,
Therefore, a moderate value for this hyperparameter is set for the we apply potential-based reward shaping depending on the extracted
performance evaluation of our approach. information about the environment in the learning process of the agent.
That means the reward shaping is performed using the state–space
𝑀cut ∶ The number of random-walk-episodes. In contrast to the cut segment information. Our analyses on sparse reward problem settings
quality threshold, the number of episodes for the random walk, denoted showed that the proposed method accelerates the learning of the agent
as 𝑀cut , exerts an implicit influence on the identification of state–space while maintaining policy invariance and without the need to prolong
segments, through the generated state-transition graph. A higher count the computation time.

482
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

Table A.4 Table B.5


Key notations. Parameters for the experiments of Macro-Q Learning in sample domains.
Notation Explanation Parameter Values
𝑠 State number of experience replay episodes 10
𝑎 Action default artificial reward −1
𝑟 Reward artificial reward for exceeding init. set −10
𝜋 Policy artificial reward for reaching subgoal 100
𝐺̃ Long-term return
𝑉 (𝑠) State value
Table B.6
𝑄(𝑠, 𝑎) State–action value
Parameters for the experiments of A2C model in sample domains.
𝛷 Potential of a state
𝐺(𝑁, 𝐴) A transition graph with node set 𝑁 and arc set 𝐴 Parameter Values
𝑣seg Value of a segment learning_rate 0.0007, 0.0005, 0.0001, 0.001
𝑀𝑐𝑢𝑡 Number of episodes for random walk gamma (discount factor) 0.9
𝑐𝑞 Cut quality threshold n_steps (# of steps per update) 32, 64, 128

Table B.7
Parameters for the experiments of DQN model in sample domains.
In future work, one can consider online identification of state–space
segments i.e. using the Segmenter component periodically rather than Parameter Values
identifying segments only once at the end of the random walk phase. learning_rate 0.00001, 0.0005, 0.0001, 0.001
However, in this direction, the question of how to combine previously gamma (discount factor) 0.9
identified segment information with the new ones should be carefully epsilon_start (initial expl. rate) 0.9, 0.5, 0.1
thought out. Moreover, the cut quality threshold can be designed with epsilon_end (final expl. rate) 0.1
a more systematic approach instead of hand-crafted as it significantly
affects the quality of identified segments. Another research direction Table B.8
can be using a population of agents in the random walk phase of Parameters for the experiments of PPO model in sample domains.
the proposed method such that the state–space segment information Parameter Values
is extracted in a faster way from the combined experiences of the learning_rate 0.0005, 0.0003, 0.0001, 0.001
agent population. Concluding, it could be valuable to conduct an gamma (discount factor) 0.9
experimental study on more complex scenarios, such as the domains n_steps (# of steps per update) 32, 64, 128
that are beyond grid world settings and have much larger state and clip_range (clipping param.) 0.1, 0.2, 0.3
action spaces.

CRediT authorship contribution statement References

Melis İlayda Bal: Conceptualization, Software, Writing – original [1] I. Menache, S. Mannor, N. Shimkin, Q-cut - dynamic discovery of sub-goals
draft. Hüseyin Aydın: Writing – review & editing, Software, Vali- in reinforcement learning, in: Proceedings of the 13th European Conference
dation. Cem İyigün: Writing – review & editing, Supervision. Faruk on Machine Learning, ECML ’02, Springer-Verlag, Berlin, Heidelberg, 2002, pp.
295–306.
Polat: Writing – review & editing, Supervision.
[2] R.S. Sutton, D. Precup, S. Singh, Between MDPs and semi-MDPs: A framework for
temporal abstraction in reinforcement learning, Artificial Intelligence 112 (1–2)
Declaration of competing interest (1999) 181–211, http://dx.doi.org/10.1016/S0004-3702(99)00052-1.
[3] T. Okudo, S. Yamada, Subgoal-based reward shaping to improve efficiency in
reinforcement learning, IEEE Access 9 (2021) 97557–97568, http://dx.doi.org/
The authors declare that they have no known competing finan-
10.1109/ACCESS.2021.3090364.
cial interests or personal relationships that could have appeared to [4] A. Demir, E. Çilden, F. Polat, Landmark based reward shaping in reinforcement
influence the work reported in this paper. learning with hidden states, in: Proceedings of the 18th International Conference
on Autonomous Agents and MultiAgent Systems, AAMAS ’19, International
Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2019,
Data availability
pp. 1922–1924.
[5] M.I. Bal, C. Iyigun, F. Polat, H. Aydin, Population-based exploration in reinforce-
Data will be made available on request. ment learning through repulsive reward shaping using eligibility traces, Ann.
Oper. Res. (2024) 1–33.
[6] P. Dayan, T. Sejnowski, Exploration bonuses and dual control, in: Machine
Acknowledgment
Learning, 1996, pp. 5–22.
[7] D. Rengarajan, G. Vaidya, A. Sarvesh, D. Kalathil, S. Shakkottai, Reinforcement
The authors would like to thank Anastasia Akkuzu for proofreading learning with sparse rewards using guidance from offline demonstration, 2022,
of the manuscript. arXiv preprint arXiv:2202.04628.
[8] S. Paul, J. Vanbaar, A. Roy-Chowdhury, Learning from trajectories via subgoal
discovery, Adv. Neural Inf. Process. Syst. 32 (2019).
Appendix A. Notations [9] T. Brys, A. Harutyunyan, H.B. Suay, S. Chernova, M.E. Taylor, A. Nowé,
Reinforcement learning from demonstration through shaping, in: Twenty-Fourth
The notation used in the paper is given in Table A.4. International Joint Conference on Artificial Intelligence, 2015.
[10] M.J. Mataric, Reward functions for accelerated learning, in: In Proceedings of
the Eleventh International Conference on Machine Learning, Morgan Kaufmann,
Appendix B. Experiment details 1994, pp. 181–189.
[11] O. Marom, B. Rosman, Belief reward shaping in reinforcement learning, in:
Further details about the experimentation of our study (including Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, No.
1, 2018, http://dx.doi.org/10.1609/aaai.v32i1.11741, URL https://ojs.aaai.org/
parameters of Macro Q-Learning and Deep-RL models that we tried for index.php/AAAI/article/view/11741.
the comparisons) are given in the Tables B.5 to B.8. [12] J. Schmidhuber, A possibility for implementing curiosity and boredom in model-
building neural controllers, in: Proceedings of the First International Conference

483
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484

on Simulation of Adaptive Behavior on from Animals To Animats, MIT Press, [38] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M.
Cambridge, MA, USA, 1991, pp. 222–227. Riedmiller, Playing atari with deep reinforcement learning, 2013, arXiv preprint
[13] D. Pathak, P. Agrawal, A.A. Efros, T. Darrell, Curiosity-driven exploration by arXiv:1312.5602.
self-supervised prediction, in: Proceedings of the 34th International Conference [39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy
on Machine Learning - Volume 70, ICML ’17, JMLR.org, 2017, pp. 2778–2787. optimization algorithms, 2017, arXiv preprint arXiv:1707.06347.
[14] A.Y. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: [40] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-
Theory and application to reward shaping, in: In Proceedings of the Sixteenth Baselines3: Reliable reinforcement learning implementations, J. Mach. Learn.
International Conference on Machine Learning, Morgan Kaufmann, 1999, pp. Res. 22 (268) (2021) 1–8, URL http://jmlr.org/papers/v22/20-1364.html.
278–287. [41] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep
[15] E. Wiewiora, G. Cottrell, C. Elkan, Principled methods for advising reinforcement reinforcement learning that matters, in: Proceedings of the AAAI conference on
learning agents, in: Proceedings of the Twentieth International Conference on artificial intelligence, Vol. 32, 2018.
International Conference on Machine Learning, ICML ’03, AAAI Press, 2003, pp. [42] A. Demir, E. Çilden, F. Polat, Landmark based guidance for reinforcement
792–799. learning agents under partial observability, Int. J. Mach. Learn. Cybern. (2022)
[16] P. Goyal, S. Niekum, R.J. Mooney, Using natural language for reward shaping 1–21.
in reinforcement learning, 2019, arXiv preprint arXiv:1903.02020. [43] A. Demir, Learning what to memorize: Using intrinsic motivation to form useful
[17] Z. Yang, M. Preuss, A. Plaat, Potential-based reward shaping in sokoban, 2021, memory in partially observable reinforcement learning, Appl. Intell. (2023) 1–19.
arXiv arXiv:2109.05022.
[18] M. Grzes, D. Kudenko, Plan-based reward shaping for reinforcement learning,
Knowl. Eng. Rev. 31 (2008) 10–22, http://dx.doi.org/10.1109/IS.2008.4670492.
[19] A. Harutyunyan, S. Devlin, P. Vrancx, A. Nowe, Expressing arbitrary reward Melis Ilayda Bal is a Doctoral Researcher at the Max
functions as potential-based advice, in: Proceedings of the AAAI Conference Planck Institute for Intelligent Systems (MPI-IS), Tübingen.
on Artificial Intelligence, Vol. 29, No. 1, 2015, http://dx.doi.org/10.1609/aaai. She received her B.S. in Industrial Engineering with a
v29i1.9628. minor in Computer Engineering, and M.S. in Operations
[20] S. Devlin, D. Kudenko, Plan-based reward shaping for multi-agent reinforcement Research degrees from the Middle East Technical Uni-
learning, Knowl. Eng. Rev. 31 (1) (2016) 44–58. versity (METU), Ankara in 2019 and 2022, respectively.
[21] B. Huang, Y. Jin, Reward shaping in multiagent reinforcement learning for She is currently a Ph.D. in Computer Science student at
self-organizing systems in assembly tasks, Adv. Eng. Inform. 54 (2022) 101800. CS@maxplanck doctoral program and her research interests
[22] Ö. Şimşek, A.G. Barto, Using relative novelty to identify useful temporal abstrac- include reinforcement learning, multi-agent systems, and
tions in reinforcement learning, in: Proceedings of the Twenty-First International game theory.
Conference on Machine Learning, ICML ’04, ACM, 2004, pp. 95–102.
[23] Ö. Şimşek, A.P. Wolfe, A.G. Barto, Identifying useful subgoals in reinforcement
Hüseyin Aydın is a Teaching and Research Assistant at the
learning by local graph partitioning, in: Proceedings of the 22nd International
Department of Computer Engineering at the Middle East
Conference on Machine Learning, ICML ’05, Association for Computing Ma-
Technical University, Ankara. He is also a visiting researcher
chinery, New York, NY, USA, 2005, pp. 816–823, http://dx.doi.org/10.1145/
at the Department of Information and Computing Science at
1102351.1102454.
the Utrecht University for his Post-Doctoral studies funded
[24] Ö. Şimşek, Behavioral Building Blocks for Autonomous Agents: Description,
by the Scientific and Technological Research Council of
Identification, and Learning (Ph.D. thesis), University of Massachusetts Amherst,
Turkey (TUBITAK). He received B.S., M.S., and Ph.D. de-
2008.
grees from the same department in 2015, 2017, and 2022,
[25] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic
respectively. His research interests include reinforcement
Programming, first ed., John Wiley & Sons, Inc., USA, 1994.
learning, multi-agent systems, artificial intelligence, and
[26] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, A Bradford
hybrid intelligence.
Book, Cambridge, MA, USA, 2018.
[27] G.A. Rummery, M. Niranjan, On-Line Q-Learning Using Connectionist Systems,
Tech. Rep., 1994.
[28] C.J.C.H. Watkins, Learning from Delayed Rewards (Ph.D. thesis), King’s College, Cem İyigün is a Professor in the Department of Indus-
Cambridge, UK, 1989. trial Engineering at the Middle East Technical University
[29] A. McGovern, R.S. Sutton, A.H. Fagg, Roles of macro-actions in accelerating (METU). Prior to joining METU in 2009, he worked as
reinforcement learning, in: Grace Hopper Celebration of Women in Computing, a lecturer and visiting assistant professor in Management
Vol. 1317, 1997, p. 15. Science and Information Systems department at Rutgers
[30] A. McGovern, R.S. Sutton, Macro-actions in reinforcement learning: An empirical Business School. He received his Ph.D. degree in 2007 from
analysis, Comput. Sci. Depart. Fac. Publ. Ser. (1998) 15. Rutgers Center for Operations Research (RUTCOR) at Rut-
[31] A.D. Laud, Theory and application of reward shaping in reinforcement learning gers University. His research interests lie primarily in data
(Ph.D. thesis), University of Illinois at Urbana-Champaign, 2004. analysis, data mining (specifically clustering problems), time
[32] R.S. Sutton, Integrated architectures for learning, planning, and reacting based series forecasting, and their applications to bioinformatics,
on approximating dynamic programming, in: In Proceedings of the Seventh climatology and electricity load forecasting.
International Conference on Machine Learning, Morgan Kaufmann, 1990, pp.
216–224.
Faruk Polat is Professor of Computer Science at the Depart-
[33] S. Amin, M. Gomrokchi, H. Satija, H. van Hoof, D. Precup, A survey of
ment of Computer Engineering at the Middle East Technical
exploration methods in reinforcement learning, 2021, arXiv preprint arXiv:2109.
University, Ankara. Dr. Polat received a B.S. degree in Com-
00157.
puter Engineering from METU in 1987. He received M.S.
[34] H. Iima, Y. Kuroe, Swarm reinforcement learning algorithms based on sarsa
and Ph.D. degrees from Computer Engineering from Bilkent
method, in: 2008 SICE Annual Conference, IEEE, 2008, pp. 2045–2049.
University in 1989 and 1994, respectively. He was a visiting
[35] A. Demir, E. Çilden, F. Polat, Automatic landmark discovery for learning agents
scholar at the Department of Computer Science at the
under partial observability, Knowl. Eng. Rev. 34 (2019) e11.
University of Minnesota, Minneapolis between 1992–1993.
[36] H. Aydın, E. Çilden, F. Polat, Using chains of bottleneck transitions to decompose
His research focuses primarily on artificial intelligence,
and solve reinforcement learning tasks with hidden states, Future Gener. Comput.
autonomous agents and multi-agent systems.
Syst. 133 (2022) 153–168.
[37] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in:
International Conference on Machine Learning, PMLR, 2016, pp. 1928–1937.

484

You might also like