1 s2.0 S0167739X24001262 Main
1 s2.0 S0167739X24001262 Main
Keywords: Reinforcement Learning (RL) algorithms encounter slow learning in environments with sparse explicit reward
Reinforcement learning structures due to the limited feedback available on the agent’s behavior. This problem is exacerbated
State–space segmentation particularly in complex tasks with large state and action spaces. To address this inefficiency, in this paper, we
Potential-based reward shaping
propose a novel approach based on potential-based reward-shaping using state–space segmentation to decompose
Reward shaping
the task and to provide more frequent feedback to the agent. Our approach involves extracting state–space
Sparse rewards
segments by formulating the problem as a minimum cut problem on a transition graph, constructed using
the agent’s experiences during interactions with the environment via the Extended Segmented Q-Cut algorithm.
Subsequently, these segments are leveraged in the agent’s learning process through potential-based reward
shaping. Our experimentation on benchmark problem domains with sparse rewards demonstrated that our
proposed method effectively accelerates the agent’s learning without compromising computation time while
upholding the policy invariance principle.
∗ Corresponding author.
E-mail addresses: [email protected] (M.İ. Bal), [email protected] (H. Aydın), [email protected] (C. İyigün), [email protected]
(F. Polat).
https://doi.org/10.1016/j.future.2024.03.057
Received 5 October 2023; Received in revised form 27 March 2024; Accepted 31 March 2024
Available online 7 April 2024
0167-739X/© 2024 Elsevier B.V. All rights reserved.
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
to sacrifice computation time while preserving the policy invariance et al. propose a landmark-based reward shaping to advance learning
principle. speed and quality [4]. They define the potentials in terms of the value
of landmarks, however, they focus on RL problems with hidden states.
Contributions. We make the following main contributions:
Conversely, our approach works on fully observable RL problems and
uses potentials in terms of the Q-value of the state–space segments.
• We propose a novel potential-based reward-shaping mechanism
to improve the efficiency of the RL algorithm in sparse reward Subgoal discovery. To accelerate learning, studies based on task decom-
environments, by using a potential function defined via a graph- position leverage the learning of macro policies defined over identified
theoretic state–space segmentation strategy. critical states, referred to as subgoals [2,8,22–24]. While some of these
• We extend the Segmented Q-Cut algorithm [1] to identify space methods use graph theoretical approaches as we do, some of them are
segments using a minimum cut problem formulation on a transi- based on the statistical analysis of agent transitions. Nevertheless, our
tion graph constructed by the agent in an online manner i.e. dur- study diverges from all these methods as it does not necessitate the
ing its interaction with the environment. additional learning of such macro policies.
• We experimentally validate the performance of the proposed re-
ward shaping mechanism on several benchmark sparse reward RL 3. Background
domains. The results show that our reward-shaping formulation
indeed improves sample efficiency while preserving the policy 3.1. Reinforcement learning
invariance.
A sequential decision-making process under uncertainty is the clas-
The rest of the paper is organized as follows. Section 2 provides the sical representation of a Reinforcement Learning problem in which
existing studies considered as related work. The necessary background actions taken by an agent in a sequence are associated with uncertain
for a better understanding of the remaining material is given in Sec- outcomes. Formally, an RL problem is generally modeled with a Markov
tion 3. Section 4 covers the proposed method in detail. Experimental Decision Process (MDP) assuming that the Markov property holds [25].
results are presented and discussed in Section 5. Finally, Section 6 A Markov Decision Process (MDP) is described with the tuple
wraps up with the conclusions and possible future research directions. ⟨S, A, T , R , 𝛾⟩ where S: a finite set of states, A: a finite set of actions,
T ∶ S × A × S → [0, 1]: a transition function T which maps the state–
2. Related work action pairs to a probability distribution over states, R ∶ S × A → R:
a reward function R which provides immediate reward after an action
Sparse rewards. Sparse reward scenarios, where the agent encounters choice in some state ∈ S, and 𝛾 ∈ [0, 1]: a discount factor that shows
rewards infrequently or not at all for the majority of its actions, present the importance of future rewards to the current state.
a formidable challenge for reinforcement learning algorithms. To ad- The RL agent’s interaction with the environment starts with observ-
dress this issue, studies have employed diverse strategies, including ex- ing its state information 𝑠𝑡 ∈ S at the decision step 𝑡 ∈ 𝑇 where 𝑇 can
ploration techniques [6], reward-based approaches [3–5], task decom- be finite in episodic or ∞ for continuing tasks. Based on the observed
position [1], imitation learning [7,8], and expert demonstrations [9]. state, the agent chooses an action 𝑎𝑡 from the admissible action set A.
One prominent avenue of research involves reward shaping [10], which At each time step, the agent chooses actions according to its policy 𝜋,
involves modifying the reward signal to provide additional feedback defined as 𝜋 ∶ S × A → [0, 1] the probability distribution over state–
to the agent. Various forms of additional reward signals are used in action pairs. Whether the agent has chosen a good or bad action is
reward shaping mechanisms such as exploration bonuses [5,6], belief- reflected with an immediate reward 𝑟𝑡 (𝑠𝑡 , 𝑎𝑡 ) ∈ R based on the reward
based signals [11], or intrinsic motivation-based signals depending on function R (𝑠𝑡 , 𝑎𝑡 ). After receiving feedback from the environment, the
state novelty [12] or curiosity [13]. Our proposed approach is under the agent then transitions to a new state 𝑠𝑡+1 ∈ S according to the
line of reward-based strategies, however, it leverages the identification transition function T (𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡+1 ). This interaction process between the
of subgoals and, hence, state–space segments. agent and the environment continues until the agent learns to perform
the RL task successfully. Therefore, after each feedback signal from the
Potential-based reward shaping. A central line of work, potential-based
environment, the agent adjusts its behavior with the goal of extracting
reward shaping (PBRS), is introduced in [14], where an artificial poten-
the best action sequences that enable it to master the RL task.
tial function is used to encourage the agent towards desirable states.
The goal of the RL agent is formally defined as maximizing its
However, careful design is crucial, as poorly crafted reward shaping
long-term return 𝐺̃ 𝑡 with
can introduce biases or suboptimal behavior [15]. Okudo and Yamada
introduce a PBRS approach in which the potentials are defined in terms ∑
𝑇
of subgoal achievements [3]. However, the agent acquires subgoals 𝐺̃ 𝑡 ≐ 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾 2 𝑟𝑡+3 + ⋯ ≐ 𝛾 𝑘−𝑡−1 𝑟𝑘 = 𝑟𝑡+1 + 𝛾 𝐺̃ 𝑡+1 , (1)
𝑘=𝑡+1
from human participants instead of automatic identification. [9] pro-
poses a potential function using expert demonstrations. They assign where 𝑇 shows the time at which the episode ends. If the problem is
a potential to the state action pairs according to their similarity to non-episodic, 𝑇 can be replaced with ∞ in this formula [26].
those seen in the demonstrations. A recent study in [16] provides The value 𝑉𝜋 (𝑠) of a state 𝑠 under a policy 𝜋 defines the expected
the agent potential-based language reward which is derived from an return starting from state 𝑠 and following policy 𝜋,
action frequency vector, however, their language-based reward func- [ ]
𝑉 𝜋 (𝑠) ≐ E𝜋 𝐺̃ 𝑡 ∣ 𝑠𝑡 = 𝑠 , (2)
tion overlooks past states. Furthermore, a distance-based potential
function derived from the A* search algorithm is introduced for a whereas the action-value 𝑄𝜋 (𝑠, 𝑎) of state–action pair 𝑠, 𝑎 under policy
specific sparse reward planning task [17]. On the other hand, potential 𝜋 defines the expected return starting from state 𝑠, taking action 𝑎 and
functions can also be plan-based [18], or auxiliary reward functions- then following policy 𝜋,
based [19]. Unlike these, our proposed reward shaping mechanism [ ]
𝑄𝜋 (𝑠, 𝑎) ≐ E𝜋 𝐺̃ 𝑡 ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 . (3)
uses a graph-theoretical potential function defined over state–space seg-
ments discovered by the agent in an online manner without requiring In well-known strategies, RL agent aims to discover the optimal pol-
any expert or prior knowledge. icy that provides the maximum expected cumulative future discounted
Although utilization of PBRS on RL problems that include a single reward by first estimating value or action-value functions that represent
agent is common, there are also studies applying PBRS on Multi-Agent the values of the policies to search and evaluate the policies in the
Reinforcement Learning settings such as [20,21]. Furthermore, Demir policy space.
470
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Temporal-difference (TD) learning is the central idea of most RL 3.3. Potential-based reward shaping
algorithms that stands on the incremental computation of value esti-
mations. TD approaches differ from Monte Carlo methods as they boot- Potential-based reward shaping (PBRS) introduced by Ng et al. [14]
strap i.e. they update the value estimates based on the learned value is a way to shape rewards to deal with the sparse reward function
estimates of successor state–action pairs with the following update rule R while preserving policy invariance. The dense reward function R ′
is obtained with additional reward signals in the form of potentials
( ) ( ) provided to the agent without changing the optimal policy for the
𝑄𝑡+1 𝑠𝑡 , 𝑎𝑡 ← 𝑄𝑡 𝑠𝑡 , 𝑎𝑡 + 𝛼𝛿𝑡 , (4) underlying MDP, a.k.a. policy invariance principle.
( ) ( ) In PBRS, shaping reward function denoted with 𝐹 , 𝐹 ∶ S × S → R,
where 𝛿𝑡 = 𝑟𝑡+1 + 𝛾𝑄𝑡 𝑠𝑡+1 , 𝑎𝑡+1 − 𝑄𝑡 𝑠𝑡 , 𝑎𝑡 is the one-step TD error for
state–action value after transition to next state 𝑠𝑡+1 and receiving 𝑟𝑡+1 . is defined as the difference between real-valued potential functions
(4) is applied in the Sarsa algorithm which is a well-known on-policy of the successive states for a state transition. The potential of a state
TD control approach extending from standard one-step TD learning, or represented by the function 𝛷, 𝛷 ∶ S → R, expresses some knowledge
TD(0) [27]. The off-policy version of the TD error of the environment as a real-value given state information.
( ) ( ) Formally, the potential-based shaping function 𝐹 (𝑠, 𝑠′ ) for a state
𝛿𝑡 = 𝑟𝑡+1 + 𝛾 max 𝑄𝑡 𝑠𝑡+1 , 𝑎 − 𝑄𝑡 𝑠𝑡 , 𝑎𝑡 (5) transition 𝑠 → 𝑠′ is defined by
𝑎
is used in the Q-Learning [28] algorithm. An extension of one-step TD 𝐹 (𝑠, 𝑠′ ) = 𝛾𝛷(𝑠′ ) − 𝛷(𝑠), (10)
that employs multi-step bootstrapping is n-step TD learning, TD(n), in
where 𝛾 is the discount factor.
which the bootstrapping considers n-step successor state–action pairs
Ng et al. prove that if 𝐹 is a potential-based reward shaping function
as
as defined in (10), then every optimal policy for the modified MDP 𝑀 ′
( ) ( )
𝑄𝑡+𝑛 𝑠𝑡 , 𝑎𝑡 ← 𝑄𝑡+𝑛−1 𝑠𝑡 , 𝑎𝑡 + 𝛼𝛿𝑡∶𝑡+𝑛 , (6) will be an optimal policy in the original MDP 𝑀 and vice versa [14].
( ) Hence, defining 𝐹 in the form given in (10) is sufficient to guarantee
where 𝛿𝑡∶𝑡+𝑛 = 𝐺𝑡∶𝑡+𝑛 − 𝑄𝑡+𝑛−1 𝑠𝑡 , 𝑎𝑡 is the n-step TD error and 𝐺𝑡∶𝑡+𝑛 ≐
( ) the policy invariance. Furthermore, the definition of both the shaping
𝑟𝑡+1 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑛−1 𝑟𝑡+𝑛 + 𝛾 𝑛 𝑄𝑡+𝑛−1 𝑠𝑡+𝑛 , 𝑎𝑡+𝑛 . Following this, in the and the potential functions are extended by [9] to include actions while
learning settings involving temporally extended actions referred to as still satisfying the policy invariance principle.
macro actions, a.k.a. options, Macro Q-Learning [29,30] with the update
rule 3.4. Segmented Q-cut algorithm
( ) ( ) [ ( ) ( )]
𝑄 𝑠𝑡 , 𝑚 ← 𝑄 𝑠𝑡 , 𝑚 + 𝛼 𝑟 + 𝛾 𝑛 max 𝑄 𝑠𝑡+𝑛 , 𝑎 − 𝑄 𝑠𝑡 , 𝑚 , (7)
𝑎 An expedient strategy to accelerate the learning process in RL
where involves task decomposition. Task-decomposition methods require the
identification of subgoals or bottlenecks to divide the task into smaller
𝑟 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + ⋯ + 𝛾 𝑛−1 𝑟𝑡+𝑛 , (8) sub-tasks and learn useful skills to successfully complete each sub-task.
is employed. Here, the approximate action value is updated after a A significant line of work leverages identifying subgoals as cuts on the
multi-step transition from state 𝑠𝑡 to 𝑠𝑡+𝑛 using macro action 𝑚. agent’s transition graph. A transition graph in the RL context, is a special
TD methods are faster due to online learning, but less stable because kind of graph structure that stores the state transitions in an MDP. Let 𝐺
of their sensitivity to initial estimates. be a transition graph defined as 𝐺 = ⟨𝑁, 𝐴⟩. 𝐺 is a capacitated directed
network in which the nodes (𝑁) denote the states whereas the arcs
(𝐴) denote the state transitions. To illustrate, the transition from state
3.2. Reward shaping
𝑠 → 𝑠′ is reflected in the graph 𝐺 with arc (𝑠, 𝑠′ ) ∈ 𝐴. The arc capacity
definition depends on the focused RL task. Previous studies that aim
Without frequent reinforcement signals to update the behavior, the
to find subgoal states mostly use state visitation frequency-based arc
RL agent struggles to discover the optimal policy in sparse and/or
capacities [1]. A minimum cut of the transition graph 𝐺 is the set of
delayed reward environments. To handle this, reward shaping mech-
arcs whose removal from 𝐺 divides it into two disjoint subgraphs and
anisms modify the reward function in a way where denser feedback
minimal with respect to some measure, e.g., the capacity of the cut arcs.
signals are provided to the agent [10].
A fundamental work belonging to this line, Q-Cut, identifies sub-
A reward shaping mechanism transforms the original MDP 𝑀 = goals by solving the Max-Flow Min-Cut problem on the transition graph
⟨S, A, T , R , 𝛾⟩ to a modified MDP 𝑀 ′ = ⟨S, A, T , R ′ , 𝛾⟩ where R ′ = R + 𝐹 of the agent’s experiences in the MDP and finds cuts of the entire state-
is the transformed reward function and 𝐹 ∶ S × A → R represents space [1], as also provided in Algorithm 1. Moreover, the LCut method
the shaping reward function [14,31]. Shaping function 𝐹 encourages focuses on local transition graphs and uses a partitioning algorithm
the agent to improve its knowledge of the environment and adjust its to extract the local cuts [23]. Both studies suggest generating macro-
behavior accordingly. actions in the form of an options framework for agents to learn skills
The shaping reward function 𝐹 can be defined in different forms to reach those subgoals [2].
based on the goal of the shaping. For instance, a shaping function that is Menache et al. also propose the automatic identification of subgoals
defined over state–action pairs may produce a bonus term to encourage (bottlenecks) in a dynamic environment with the Segmented Q-Cut
the agent for the exploration of novel state–action pairs [32]. The agent algorithm [1]. The Segmented Q-Cut is an extension of the Q-Cut
receives a bonus term represented with B𝑡 (𝑠, 𝑎) over its extrinsic reward algorithm, in which the general idea is to apply cuts on the state
signal 𝑟𝑡 (𝑠, 𝑎) after visiting the state–action pair (𝑠, 𝑎) at time step 𝑡. segments recursively. In this manner, the Segmented Q-Cut outperforms
Hence, the final reward value for visiting the (𝑠, 𝑎) pair at time step the Q-Cut algorithm for finding multiple subgoals in the given problem.
𝑡 becomes A subgoal or bottleneck is defined as the border state of strongly
connected areas of the transition history graph of the agent. As specified
𝑟̃𝑡 (𝑠, 𝑎) = 𝑟𝑡 (𝑠, 𝑎) ⊕ B𝑡 (𝑠, 𝑎), (9)
before, the transition graph holds the agent’s experiences as state
where ⊕ is the aggregation operation on extrinsic reward and the bonus transitions in a directed and capacitated graph. The capacity of the arcs
term [33]. in the transition graph is defined with the relative frequency measure.
Moreover, in the potential-based version of reward shaping, the Relative frequency 𝑐𝑟𝑓 is computed by
function 𝐹 is defined using some potential function of states which is 𝑛(𝑖 → 𝑗)
explained in the following subsection in detail. 𝑐𝑟𝑓 = , (11)
𝑛(𝑖)
471
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
472
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Q-Segmenter agent has a Segmenter component whose main role is episode. Training of Q-Segmenter is completed after the agent performs
to compute the state–space segments using the subgoals identified by a pre-determined number of episodes. The details of the methods are
its inner component called Subgoal-Identifier and then convey the ex- given in the following sections.
tracted segments information to the Q-Segmenter agent. Moreover, the
Segmenter behaves as a bridge between the Q-Segmenter agent and the Algorithm 4 Learning with Potential-based Reward Shaping Using
Subgoal-Identifier component by transferring the transition experience State-Space Segmentation
of the Q-Segmenter denoted as a tuple ⟨𝑠, 𝑎, 𝑠′ ⟩ to the Subgoal-Identifier 1: procedure learnWithESegQCut
component. 2: require: ⟨S, A, T , R ⟩,
The innermost element of this model is the Subgoal-Identifier whose learning rate 𝛼 ∈ (0, 1], exploration rate 𝜀 ∈ (0, 1],
main task is to first translate the transition experiences of the Q- discount factor 𝛾 ∈ (0, 1], step limit L ≥ 1,
Segmenter into a transition graph up to a specific period and then, cut quality threshold 𝑐𝑞 ≥ 1,
extract the subgoals by performing the ESegQ-Cut algorithm on the num. of episodes 𝑀 ≥ 1,
final transition graph. The tasks of this component also include trans- num. of episodes for random walk 𝑀𝑐𝑢𝑡 ≥ 1
ferring the information of identified subgoals to the outer Segmenter 3: 𝑄𝑆𝐸𝐺 (𝑠, 𝑎) ← 0, ∀𝑠 ∈ S, 𝑎 ∈ A
component. 4: ← [], V ← [] ⊳ Initialize Segments & Segment Values
The pseudocode for the method is given in Algorithm 4. The al- 5: for 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 ← 0 𝐭𝐨 𝑀 do
gorithm takes the necessary inputs for applying the Q-Learning [28] 6: if 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 is 0 then
approach along with the inputs related to extracting state–space seg- 7: ← randomWalk(L, 𝑀𝑐𝑢𝑡 , 𝑐𝑞 , )
ment information such as the number of episodes for performing the 8: if is not empty then
random walk and cut-quality threshold. The method starts with initial- 9: V ← getSegmentValues()
izing a Q-table 𝑄𝑆𝐸𝐺 for each state–action pair, required lists to hold 10: end if
the segments and the values of these segments ( and V respectively), 11: Initialize 𝑠 ∈ S
and then setting the episode number to zero. At the beginning of 12: while 𝑠 is not terminal or L is not reached do
the initial episode, before the learning of Q-Segmenter agent starts, 13: Choose 𝑎 ← EPSILON-GREEDY(𝑄𝑆𝐸𝐺 , 𝜀)
Q-Segmenter completes a random walk phase in the environment via 14: Take action 𝑎, observe 𝑟, 𝑠′
randomWalk() procedure for the sake of identifying state–space 15: 𝑟̃ ← shapeReward(𝑟, , V )
segments. The agent obtains the state–space segments after completing 16: 𝛿 ← 𝑟̃ + 𝛾 max𝑎′ 𝑄𝑆𝐸𝐺 (𝑠′ , 𝑎′ ) − 𝑄𝑆𝐸𝐺 (𝑠, 𝑎)
the random walk and computes the values of the segments via the 17: Update 𝑄𝑆𝐸𝐺 (𝑠, 𝑎) ← 𝑄𝑆𝐸𝐺 (𝑠, 𝑎) + 𝛼𝛿
getSegmentValues() procedure if any segment is identified. With 18: 𝑠 ← 𝑠′
the knowledge of segments and their values, Q-Segmenter agent starts 19: end while
the learning phase by applying the Q-Learning algorithm. Compared to 20: V ← updateSegmentValues()
the standard Q-Learning algorithm, Q-Segmenter agent shapes its envi- 21: end if
ronmental reward signal 𝑟 after taking action 𝑎 in state 𝑠. The shaped 22: end for
reward signal is denoted by 𝑟̃ and computed with the shapeReward() 23: return 𝑄𝑆𝐸𝐺
procedure. Q-Segmenter then uses 𝑟̃ in the update rule for its Q-table 24: end procedure
𝑄𝑆𝐸𝐺 . Furthermore, the agent also updates the segment values with
the updateSegmentValues() method after each episode.
The flowchart of the learning process with potential-based reward 4.2.1. State–space segmentation
shaping using state–space segmentation with ESegQ-Cut algorithm is Random walk. In the initial episode of the learning, Q-Segmenter goes
presented in Fig. 2. In summary, the learning process starts with the through a random walk phase in the environment to identify state–
agent’s random walk in the environment for a specific period and space segments. The general idea of this phase is illustrated in Fig. 3.
the accumulation of the transitions on a transition graph. After the The phase starts with the interaction between Q-Segmenter and the
random walk is completed, the ESegQ-Cut algorithm is applied to environment. Through the random walk, the agent generates trajecto-
the transition graph, and the state–space segments are constructed ries, and collected trajectories from all episodes of the random walk
using the identified subgoals and weakly connected components on the phase are transformed into a transition graph. The transition graph
transition graph excluding these subgoals. Following this segmentation is then given to the Extended Segmented Q-Cut algorithm. At this
of the state space, training of Q-Segmenter agent starts. The agent stage, first Segmented Q-Cut algorithm is applied and subgoal states
interacts with the environment and shapes the environmental reward are determined in the graph, then using these subgoals, the segments
signal using the extracted knowledge on state–space segments. It then of the state–space are extracted. The random walk phase is finalized by
updates its Q-value estimates with the Q-Learning update rule. At the returning the identified segment information to the learning phase.
end of each episode, the agent updates its knowledge of state–space The random walk explained in Algorithm 5 continues for 𝑀𝑐𝑢𝑡
segments in terms of their value which will be utilized in the following many episodes. Similar to the Q-Learning algorithm, each episode
473
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
two segments and proceed with each created segment. The quality of
a cut is expressed as the ratio cut bi-partitioning metric as defined in
(12). Since good quality cuts should have a small number of arcs while
separating significant balanced areas, only the cuts having a quality
level greater than the pre-determined threshold 𝑐𝑞 are acceptable. Thus,
Fig. 2. The flowchart of the learning process with the proposed method.
the Segmented Q-Cut method runs until no new cut that satisfies the
quality condition is identified. At the end of this method, subgoals are
detected as border states at the cuts.
of the random walk ends until the terminal (goal) state is found or
The extension of the Segmented Q-Cut method starts after obtaining
the step limit (L) is reached. At each time step in an episode, the
the subgoals. To extract the state–space segments, we first search for
agent chooses a random action 𝑎 ∈ A at a state 𝑠 and observes its
adjacent subgoals in the identified 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 list (X ). We group the
consequences as a reward signal 𝑟 and next state 𝑠′ . The tuple ⟨𝑠, 𝑎, 𝑠′ ⟩
adjacent subgoals and then for each group, we choose the subgoal
experienced by the agent is added to the transition graph 𝐺. At each
time the transition ⟨𝑠, 𝑎, 𝑠′ ⟩ is added to the graph 𝐺, the frequency of having a minimum degree and append it to the 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 list
observing the transition from 𝑠 to 𝑠′ is incremented by one. Moreover, (XS ). We select the subgoal with the minimum degree within the group
the frequency of visiting state 𝑠 is also incremented by one. Then, since because it adheres to the definition of a significant i.e. high-quality cut.
the arc capacity is defined in terms of relative frequency as defined in As explained in Menache et al.’s study [1], a significant source–sink cut
(11), the capacity of the arc (𝑠, 𝑠′ ) that reflects the transition 𝑠 → 𝑠′ (𝑠 − 𝑡 cut) is a cut with a small number of arcs while creating enough
is adjusted accordingly using these updated frequencies. After 𝑀𝑐𝑢𝑡 states both in the source segment 𝑁𝑠 and the sink segment 𝑁𝑡 . To this
episodes are completed in the random walk, the transition graph is end, we aim to obtain only significant subgoals with this extension.
accumulated with the agent’s experiences. The graph is then given to To move from subgoals to state–space segments, nodes denoting
the ESegQ-Cut() method as an input that first computes the subgoals the selected significant subgoals are removed from the transition graph
and then extracts the segments of the state-space. Finally, the extracted 𝐺. With the Union Find algorithm, weakly connected components of
state–space segmentation information is returned to the Q-Segmenter the resulting graph are detected. Each weakly connected component
agent for use in the learning process. is treated as a new segment and added to . Finally, each identified
subgoal in XS is also added to as a separate segment. The procedure
Extended segmented Q-cut. The main idea of the ESegQ-Cut method is terminates after the extracted are returned.
to identify the subgoals with the Segmented Q-Cut algorithm, and then
extract the state–space segments using the identified subgoals as can be
seen in Algorithm 6. 4.2.2. Reward shaping based on state–space segmentation
The procedure starts with appending all the nodes in transition Within the completion of the random walk procedure, the Q-
graph 𝐺 to the list of 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 ( ) and initializing the required lists Segmenter agent gains knowledge of the environment through extracted
to store subgoal-related information. In addition, the 𝑠𝑜𝑢𝑟𝑐𝑒 and 𝑠𝑖𝑛𝑘 state–space segments. The agent should then benefit from this knowl-
nodes are also determined at this point. Since the agent may start at a edge in the learning phase. However, the question is: How should the
different state in each episode of the random walk, the 𝑠𝑜𝑢𝑟𝑐𝑒 node is agent utilize state–space segment information in its learning process
selected as a dummy node added to the graph. Similarly, each episode to speed up the learning? The proposed method suggests applying
of the random walk may terminate at the goal state or an arbitrary state potential-based reward shaping using segment information in terms of
depending on the step limit L. Therefore, the 𝑠𝑖𝑛𝑘 node is selected as values. The basic idea is to compute the values of the segments and
the goal state if it exists in the graph 𝐺, otherwise, a dummy goal state reflect the potential of the states in terms of segment values. Depending
is chosen. on the potential, the environmental reward signals are shaped and
The procedure continues with the application of the Segmented utilized in the Q-value estimations. The illustration that summarizes the
Q-Cut algorithm with the Cut() method explained in Algorithm 7. learning phase is given in Fig. 4. In the following sections, a potential-
While we can identify good-quality cut point(s) on the graph which based reward-shaping strategy depending on the value of segments is
is denoted by the boolean variable 𝑑𝑖𝑣𝑖𝑑𝑒, we divide the graph into presented.
474
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Value of the segments. The value of a segment 𝑖 denoted by 𝑣𝑠𝑒𝑔𝑖 , 𝑖 ∈ , If any segment is found by the completion of the random walk
is defined as the expected return starting from segment 𝑖 and computed procedure, the agent computes the segment values with getSeg-
as the average Q-value of all possible states in segment 𝑖 for all possible mentValues procedure as given in Algorithm 8 using the rule (13).
actions with Since Q-values for each state–action pair are initialized to zero at the
∑ ∑𝑎∈A 𝑄(𝑠,𝑎) beginning of the first episode, the initial values of the segments also
𝑠∈𝑖 ∣A∣ become zero. However, as the Q-values are updated during the learning
𝑣𝑠𝑒𝑔𝑖 = , (13) process, the segment values should be changed accordingly. By doing
∣𝑖∣
[ ] this, the agent will be able to determine which segment is more useful
where 𝑄(𝑠, 𝑎) ≐ E 𝐺𝑡 ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 shows the expected return starting to reach the goal state. Therefore, Q-Segmenter updates the segment
from state 𝑠 and taking action 𝑎, ∣ A ∣ represents the size of the action- values at the end of each episode by the updateSegmentValues
space and ∣ 𝑖 ∣ denotes the size of the segment 𝑖 i.e. the number of states method. As explained in more detail in Algorithm 9, this method
that belong to segment 𝑖. updates the segment values with the update rule
The average state–action value is shown to be an effective Q-value 𝑣𝑛+1 𝑛 𝑛+1 𝑛
(14)
𝑠𝑒𝑔 ← 𝑣𝑠𝑒𝑔 + 𝛼(𝛾𝑣𝑠𝑒𝑔 − 𝑣𝑠𝑒𝑔 ),
𝑖 𝑖 𝑖 𝑖
update rule, specifically in swarm reinforcement learning [34], as it
captures the diversity of Q-values among agents’ individual learning. where 𝛼 ∈ (0, 1] is the learning rate, 𝛾 ∈ (0, 1] is the discount factor and
Motivated by the insight from [34], we capture the diversity of Q-values 𝑛 ∈ [0, 𝑀] denotes the episode number.
belonging to a segment via segment value definition (13). However, Potential-based reward shaping using values of the segments. After the
in Section 5.4.1, we present the analysis where we explore a different initial computation of segment values, Q-Segmenter starts interacting
segment value definition based on optimistic initialization [26]. with the environment by applying the Q-Learning algorithm.
475
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Algorithm 6 Extended Segmented Q-Cut Algorithm Algorithm 8 Computation of the Segment Values
1: procedure ESegQCut 1: procedure getSegmentValues
2: require: Graph 𝐺 = ⟨𝑁, 𝐴⟩, 𝑐𝑞 , 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 2: require: 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 , 𝑠𝑒𝑔𝑚𝑒𝑛𝑡_𝑣𝑎𝑙𝑢𝑒𝑠 V , 𝑄𝑆𝐸𝐺
3: ← all nodes in the graph 𝐺 3: for each segment 𝑖 in do
4: 𝑠𝑜𝑢𝑟𝑐𝑒 ← dummy node 4: 𝑠𝑢𝑚 ← 0
5: 𝑠𝑖𝑛𝑘 ← node denoting the goal state if exists in 𝐺, 5: for each state 𝑠 in
∑
𝑖 do
otherwise dummy goal 𝑎∈A 𝑄𝑆𝐸𝐺 (𝑠,𝑎)
6: 𝑠𝑢𝑚 ← 𝑠𝑢𝑚 + ∣A∣
6: 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 X ←[ ], 𝑠𝑢𝑏𝑔𝑜𝑎𝑙 𝑔𝑟𝑜𝑢𝑝𝑠 XG ←[ ], 7: end for
7: 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 XS ←[ ] 8:
𝑠𝑢𝑚
V𝑖 ← ∣ ∣
8: 𝑑𝑖𝑣𝑖𝑑𝑒 ← 𝑇 𝑟𝑢𝑒 𝑖
9: end for
9: while 𝑑𝑖𝑣𝑖𝑑𝑒 do
10: return V
10: for each segment in the do
11: end procedure
11: 𝑑𝑖𝑣𝑖𝑑𝑒 ← 𝑑𝑖𝑣𝑖𝑑𝑒 and Cut(𝐺, 𝑐𝑞 , , 𝑠𝑜𝑢𝑟𝑐𝑒, 𝑠𝑖𝑛𝑘, X )
12: end for
13: end while Algorithm 9 Update of the Segment Values
14: X ← getSubgoals()
1: procedure updateSegmentValues
15: XG ← Identify the adjacent subgoals in X , if any
2: require: 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 , 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠 V , 𝛼, 𝛾
16: for each 𝑔𝑟𝑜𝑢𝑝 in XG do
3: 𝑛𝑒𝑤 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠 V ′ ← getSegmentValues()
17: Append subgoal having min. degree in the 𝑔𝑟𝑜𝑢𝑝 to XS
4: for each segment 𝑖 in do
18: end for
5: 𝛥𝑠𝑒𝑔 ← 𝛾(V ′ ) − V𝑖
19: for each node 𝑛 in XS do 𝑖
20: 𝐴 ← 𝐴 ⧵ {(𝑛, 𝑖), (𝑖, 𝑛)}, ∀𝑖 ∈ 𝑁 6: V𝑖 ← V𝑖 + 𝛼 ⋅ 𝛥𝑠𝑒𝑔
21: 𝑁 ← 𝑁 ⧵ {𝑛} 7: end for
22: end for 8: return V
23: 𝑤𝑐𝑐 ← Union Find(𝐺) ⊳ weak. conn. comp. of resulting 𝐺 9: end procedure
24: ← 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 ∪ {𝑐}, ∀𝑐 ∈ 𝑤𝑐𝑐
25: ← 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 ∪ {𝑥}, ∀𝑥 ∈ XS
26: return
where is the set of all segments, 1𝑠∈𝑖 is the indicator function that
27: end procedure
takes the value of 1 if state 𝑠 is in segment 𝑖, and 0 otherwise, while
𝑣𝑠𝑒𝑔𝑖 represents the value of the segment 𝑖. A state can be an element
of only one segment. From this, we can conclude that the potential of
Algorithm 7 Cut Procedure of Extended Segmented Q-Cut
a state shows the value of the segment to which the state belongs.
1: procedure Cut We define the potential-based reward shaping function 𝐹 , 𝐹 ∶ S ×
2: require: Graph 𝐺 = ⟨𝑁, 𝐴⟩, 𝑐𝑞 , 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 S → R in terms of the difference between the potentials of states 𝑠 and
𝑠𝑜𝑢𝑟𝑐𝑒, 𝑠𝑖𝑛𝑘, 𝑠𝑢𝑏𝑔𝑜𝑎𝑙𝑠 X 𝑠′ for transition 𝑠 → 𝑠′ as
3: Subgraph 𝐺𝑠𝑢𝑏 = ⟨𝑁𝑠𝑢𝑏 , 𝐴𝑠𝑢𝑏 ⟩, 𝑁𝑠𝑢𝑏 ← ∅, 𝐴𝑠𝑢𝑏 ← ∅
𝐹 (𝑠, 𝑠′ ) = 𝛾𝛷(𝑠′ ) − 𝛷(𝑠)
4: 𝐺𝑠𝑢𝑏 ← Induced subgraph of the graph 𝐺 containing ∑ ∑
the nodes in and the arcs between those nodes =𝛾 1𝑠′ ∈𝑖 𝑣𝑠𝑒𝑔𝑖 − 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖
(16)
𝑖∈ 𝑖∈
5: 𝑚𝑖𝑛_𝑐𝑢𝑡_𝑣𝑎𝑙𝑢𝑒, 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 ← minCut(𝐺𝑠𝑢𝑏 , 𝑠𝑜𝑢𝑟𝑐𝑒, 𝑠𝑖𝑛𝑘)
6: Obtain 𝑁𝑠 and 𝑁𝑡 from 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 = 𝛾𝑣𝑠𝑒𝑔𝑘 − 𝑣𝑠𝑒𝑔𝑗 ,
7: 𝑛𝑠 ←∣ 𝑁𝑠 ∣ ⊳ size of source segment where 𝑠′ ∈ segment 𝑘, 𝑠 ∈ segment 𝑗 and 𝑘, 𝑗 ∈ .
8: 𝑛𝑡 ←∣ 𝑁𝑡 ∣ ⊳ size of sink segment
9: 𝐴𝑠,𝑡 ← number of nodes connecting 𝑁𝑠 & 𝑁𝑡 Theorem 1. Let MDP 𝑀 = ⟨S, A, T , R , 𝛾⟩ and transformed MDP 𝑀 ′ be
𝑛 ⋅𝑛
10: 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 ← 𝐴𝑠 𝑡 defined as 𝑀 ′ = ⟨S, A, T , R ′ , 𝛾⟩ where R ′ = R + 𝐹 and the shaping reward
𝑠,𝑡
11: if 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 > 𝑐𝑞 then function 𝐹 ∶ S × S → R is defined as in (16). The potential-based reward
12: Identify new 𝑠𝑜𝑢𝑟𝑐𝑒𝑠 and 𝑠𝑖𝑛𝑘𝑠 for 𝑁𝑠 shaping function 𝐹 defined in (16) preserves the policy invariance i.e. every
13: Identify new 𝑠𝑜𝑢𝑟𝑐𝑒𝑡 and 𝑠𝑖𝑛𝑘𝑡 for 𝑁𝑡 optimal policy in 𝑀 ′ is also an optimal policy in 𝑀 (and vice versa).
14: X ← X ∪ {𝑠𝑖𝑛𝑘𝑠 , 𝑠𝑜𝑢𝑟𝑐𝑒𝑡 }
15: ← {𝑁𝑠 , 𝑁𝑡 } Proof. The optimal Q-function for the original MDP 𝑀, denoted as 𝑄∗𝑀
16: return True ⊳ another cut is possible satisfies the Bellman optimality equation:
17: end if [ ]
( )
18: return False 𝑄∗𝑀 (𝑠, 𝑎) = E𝑠′ R (𝑠, 𝑎) + 𝛾 max 𝑄∗𝑀 𝑠′ , 𝑎′ . (17)
𝑎′ ∈𝐴
19: end procedure
When we subtract 𝛷(𝑠) from both sides
[ ( )
𝑄∗𝑀 (𝑠, 𝑎) − 𝛷(𝑠)
= E𝑠′ R (𝑠, 𝑎) + 𝛾𝛷 𝑠′ − 𝛷(𝑠)
]
( ( ) ( )) (18)
However, it shapes the reward signals received from the environment +𝛾 max 𝑄∗𝑀 𝑠′ , 𝑎′ − 𝛷 𝑠′ .
𝑎′ ∈𝐴
with potential-based reward shaping by the means of the state–space
∑
segmentation and uses shaped rewards to update its Q-value estimates Since 𝛷(𝑠) = 𝑖∈ 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖 , we get
∑
for state–action pairs. 𝑄∗𝑀 (𝑠, 𝑎) − 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖 = E𝑠′ [R (𝑠, 𝑎)
𝑖∈
The potential of a state 𝑠, denoted with the function 𝛷(𝑠), 𝛷 ∶ S → R ∑ ∑
+𝛾 1𝑠′ ∈𝑖 𝑣𝑠𝑒𝑔𝑖 − 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖
is defined by (19)
𝑖∈ 𝑖∈
( )]
∑ ( ) ∑
𝛷(𝑠) = 1𝑠∈𝑖 𝑣𝑠𝑒𝑔𝑖 , (15) +𝛾 max 𝑄∗𝑀 𝑠′ , 𝑎′ − 1𝑠′ ∈𝑖 𝑣𝑠𝑒𝑔𝑖 .
′ 𝑎 ∈𝐴
𝑖∈ 𝑖∈
476
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
477
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Fig. 7. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Six-Rooms GridWorld domain with a limit of 1000 steps per episode.
478
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Table 3
Overall performance comparison of the proposed method Q-Segmenter with benchmarks.
Problem Method Average Steps (stdev) Average Reward (stdev) Average Average Elapsed Time (sec) for
Elapsed Time
sec (stdev)
Q-Segmenter 696.33 (166.83) 444.04 (458.39) 0.06 (0.04) 0.13 (0.11) 121.26 (45.17) 20.58 (9.09) 100.67 (45.83)
Six-rooms
Q-Learning 984.99 (17.94) 929.40 (228.14) 0.00 (0.00) 0.01 (0.05) 93.71 (7.15) – –
(1000 steps)
Macro Q-Learning 998.89 (0.95) 999.00 (0.00) 0.00 (0.01) 0.00 (0.00) 54.70 (7.99) 1.22 (0.02) 53.47 (7.98)
Q-Segmenter 198.59 (391.22) 43.70 (5.19) 0.18 (0.08) 0.23 (0.03) 67.45 (6.70) 33.47 (1.95) 33.98 (6.09)
Six-rooms
Q-Learning 741.70 (800.56) 46.94 (5.20) 0.13 (0.09) 0.22 (0.02) 70.48 (14.89) – –
(2000 steps)
Macro Q-Learning 1369.72 (150.26) 1283.40 (932.08) 0.01 (0.04) 0.00 (0.00) 63.84 (33.16) 2.46 (0.07) 61.39 (33.12)
Q-Segmenter 1808.44 (89.51) 1489.18 (806.83) 0.01 (0.00) 0.03 (0.05) 196.59 (42.78) 6.94 (3.96) 189.64 (46.54)
Zigzag-four-rooms
Q-Learning 1924.65 (104.13) 1562.38 (755.86) 0.00 (0.00) 0.02 (0.05) 186.77 (12.23) – –
(2000 steps)
Macro Q-Learning 1998.64 (2.46) 1999.00 (0.00) 0.00 (0.01) 0.00 (0.00) 133.5 (12.31) 3.77 (0.08) 129.73 (12.31)
Q-Segmenter 186.45 (473.57) 74.00 (6.39) 0.11 (0.03) 0.14 (0.01) 60.93 (4.44) 30.34 (0.84) 30.58 (4.57)
Zigzag-four-rooms
Q-Learning 1169.62 (1742.68) 75.74 (5.40) 0.09 (0.06) 0.13 (0.01) 116.71 (18.78) – –
(5000 steps)
Macro Q-Learning 4491.01 (56.40) 4502.60 (1478.06) 0.01 (0.07) 0.00 (0.00) 214.33 (67.97) 5.61 (0.19) 208.72 (67.94)
Q-Segmenter 384.69(669.94) 54.00(2.26) 0.14(0.06) 0.19(0.01) 130.08 (27.19) 53.66 (4.29) 76.42 (28.05)
Locked Shortcut Six-rooms
Q-Learning 1257.07(1301.60) 56.00(3.55) 0.10(0.08) 0.18(0.01) 116.20 (19.80) – –
(3000 steps)
Macro Q-Learning 2842.97 (39.81) 2881.30 (576.61) 0.00 (0.02) 0.00 (0.00) 221.29 (46.50) 6.19 (0.19) 215.11 (46.48)
Q-Segmenter 308.39(672.78) 56.30(3.27) 0.14(0.06) 0.18(0.01) 180.84 (15.43) 66.19 (2.38) 114.64 (15.04)
Locked Shortcut Six-rooms
Q-Learning 840.56 (1566.00) 56.70(4.30) 0.01(0.01) 0.02(0.00) 84.58 (12.40) – –
(5000 steps)
Macro Q-Learning 4028.74 (257.89) 3813.42 (2109.76) 0.01 (0.11) 0.00 (0.00) 210.85 (93.40) 6.28 (0.26) 204.58 (93.33)
Fig. 8. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Six-Rooms GridWorld domain with a limit of 2000 steps per episode.
Fig. 9. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Zigzag Four-Rooms GridWorld domain with a limit of 2000 steps per
episode.
Fig. 10. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Zigzag Four-Rooms GridWorld domain with a limit of 5000 steps
per episode.
479
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Fig. 11. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Locked Shortcut Six-Rooms GridWorld domain with a limit of
3000 steps per episode.
Fig. 12. Learning performances of proposed method Q-Segmenter, Q-Learning and Macro Q-Learning for Locked Shortcut Six-Rooms GridWorld domain with a limit of
5000 steps per episode.
Fig. 13. Segments and cuts on the graph after random walk phase for 25 episodes is completed in Six-Rooms GridWorld domain. Each node on the transition graph represents
a state, whereas arcs are the possible actions taken to reach the pointed states. The white nodes and the white rectangle nodes symbolize source and sink states, respectively.
The nodes with the same color belong to the same state segment. The highlighted green nodes show the identified subgoal states and the accompanied red dashed arcs are the
minimum cuts. As illustrated, the agent identified 5 state–space segments in Six-Rooms GridWorld domain.
5.4. Analysis average Q-value of all states belonging to the segment, also provided
in (13). However, based on the optimistic initialization idea [26], we
In this section, we further analyze the specific design choices of can define the value of a segment 𝑖 optimistically using the rule
our approach, specifically, considering an alternative segment value ∑
definition in Section 5.4.1, and different values for critical hyperpa- 𝑎∈A 𝑄(𝑠, 𝑎)
𝑣𝑠𝑒𝑔𝑖 = max . (24)
rameters such as cut quality threshold, and the number of episodes for 𝑠∈𝑖 ∣A∣
the random walk in Section 5.4.2. [ ]
Here, 𝑄(𝑠, 𝑎) ≐ E 𝐺̃ 𝑡 ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 represents the expected return
5.4.1. Alternatives for the segment value starting from state 𝑠 and taking action 𝑎, while ∣ A ∣ denotes the size of
Motivated by the average-Q information exchange among swarm the action space. This rule suggests assigning the maximum state value
agents introduced in [34], we define the value of segments as the as the segment value, thereby encouraging exploration.
480
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Fig. 14. Segments and cuts on the graph after random walk phase for 25 episodes is completed in Locked Shortcut Six-Rooms domain. Each node on the transition graph
represents a state, whereas arcs are the possible actions taken to reach the pointed states. The white nodes and the white rectangle nodes symbolize source and sink states,
respectively. The nodes with the same color belong to the same state segment. The highlighted green nodes show the identified subgoal states and the accompanied red dashed
arcs are the minimum cuts. Based on this, it is clear that the agent identified 6 state–space segments in Locked Shortcut Six-Rooms domain.
Fig. 15. Learning performances of proposed method Q-Segmenter, Q-Segmenter-MaxQ with segment value definition given in (24) and Q-Learning for Six-Rooms GridWorld
domain with a limit of 1000 steps per episode.
Fig. 16. Learning performances of proposed method Q-Segmenter, Q-Segmenter-MaxQ with segment value definition given in (24) and Q-Learning for Six-Rooms GridWorld
domain with a limit of 2000 steps per episode.
We assessed the performance of the optimistic segment value vari- and 16, Q-Segmenter-MaxQ, leveraging enhanced exploration cou-
ant of the Q-Segmenter, referred to as Q-Segmenter-MaxQ, in com- pled with a reward shaping mechanism, exhibits superior performance
parison to our own approach within the Six-Rooms GridWorld in the initial learning episodes. However, in line with observations
domain, imposing step limits of 1000 and 2000. We employed the same in [26], the efficacy of optimistic initialization of values is inherently
performance metrics detailed in Section 5.3. As depicted in Figs. 15 temporary. Moreover, the agent may struggle to accurately identify
481
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Fig. 17. Learning performances of proposed method Q-Segmenter with different cut Fig. 18. Learning performances of proposed method Q-Segmenter with different number
quality threshold choices under Six-Rooms GridWorld domain with a limit of 1000 of random walk episodes, 𝑀cut , choices under Six-Rooms GridWorld domain with
steps per episode. a limit of 1000 steps per episode.
critical segments, such as those containing the goal state, due to the of random walk episodes is anticipated to yield a transition graph with
overestimation of the value of other segments. As segment values an increased number of nodes and arcs. However, this amplified size
are gradually updated, the impact of this overestimation diminishes. of the transition graph introduces challenges in accurately identifying
Nevertheless, this correction is insufficient to significantly expedite the high-quality cuts. Consequently, the misidentification of segments may
learning process, as demonstrated in the results. However, conducting lead to misguided learning on the part of the agent. From a technical
a more comprehensive analysis across diverse problem settings and point of view, employing a substantial number of random walk episodes
exploring various segment value definitions would provide a more may pose a significant bottleneck in real-world scenarios, especially
nuanced understanding. since even larger transition graphs are expected to be constructed by
the agent.
5.4.2. Exploration of parameter choices For this, we investigated the impact of 𝑀cut on the learning per-
Cut quality threshold. The quality of the partitioned state space, as formance in the Six-Rooms GridWorld domain while imposing a
articulated by the metric in (12), is heavily influenced by the chosen limit of 1000 steps per episode. As shown in Fig. 18, a notable trend
cut quality threshold. A desirable outcome involves achieving a bal- is seen: a reduction in the number of random walk episodes correlates
anced partition with a minimal number of arcs in the cut. Considering with improved performance. With extended random walk episodes, the
this, we conducted an exploration of various cut quality thresholds to agent becomes more susceptible to identifying suboptimal cuts, as the
assess their impact on the learning performance of the agent within transition graph is likely to expand, and the visitation frequency values
the Six-Rooms GridWorld domain with a limit of 1000 steps per for randomly explored states tend to increase. This, in turn, has a
episode. consequential effect on cut identification. From a technical standpoint,
In Fig. 17, we illustrate the comparison of learning performance a scenario with a limited number of random walk episodes poses less
w.r.t. the number of steps and average reward measures under different complexity for cut identification, as we anticipate a relatively smaller
cut quality thresholds. A poor learning performance is observed when state transition graph in such cases.
employing both low and high values for the cut quality threshold. In
the case of a low threshold, the agent tends to recognize cuts involving 6. Conclusion
states that are not highly useful, such as those that are not associated
with hallways. Conversely, higher threshold values result in the agent In this study we propose a potential-based reward shaping using
identifying a reduced number of subgoals, leading to crowded segments state–space segmentation with the extended segmented Q-Cut algorithm to
within the state space. If an initial cut is inaccurately identified, the overcome the slow learning in RL problems with sparse explicit reward
agent may propagate this error to subsequent segmentations in the structure. In this method, rather than using a reward shaping mecha-
cut procedure, primarily due to the stringent cut quality condition. nism to extract a piece of useful information about the environment,
Therefore, a moderate value for this hyperparameter is set for the we apply potential-based reward shaping depending on the extracted
performance evaluation of our approach. information about the environment in the learning process of the agent.
That means the reward shaping is performed using the state–space
𝑀cut ∶ The number of random-walk-episodes. In contrast to the cut segment information. Our analyses on sparse reward problem settings
quality threshold, the number of episodes for the random walk, denoted showed that the proposed method accelerates the learning of the agent
as 𝑀cut , exerts an implicit influence on the identification of state–space while maintaining policy invariance and without the need to prolong
segments, through the generated state-transition graph. A higher count the computation time.
482
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
Table B.7
Parameters for the experiments of DQN model in sample domains.
In future work, one can consider online identification of state–space
segments i.e. using the Segmenter component periodically rather than Parameter Values
identifying segments only once at the end of the random walk phase. learning_rate 0.00001, 0.0005, 0.0001, 0.001
However, in this direction, the question of how to combine previously gamma (discount factor) 0.9
identified segment information with the new ones should be carefully epsilon_start (initial expl. rate) 0.9, 0.5, 0.1
thought out. Moreover, the cut quality threshold can be designed with epsilon_end (final expl. rate) 0.1
a more systematic approach instead of hand-crafted as it significantly
affects the quality of identified segments. Another research direction Table B.8
can be using a population of agents in the random walk phase of Parameters for the experiments of PPO model in sample domains.
the proposed method such that the state–space segment information Parameter Values
is extracted in a faster way from the combined experiences of the learning_rate 0.0005, 0.0003, 0.0001, 0.001
agent population. Concluding, it could be valuable to conduct an gamma (discount factor) 0.9
experimental study on more complex scenarios, such as the domains n_steps (# of steps per update) 32, 64, 128
that are beyond grid world settings and have much larger state and clip_range (clipping param.) 0.1, 0.2, 0.3
action spaces.
Melis İlayda Bal: Conceptualization, Software, Writing – original [1] I. Menache, S. Mannor, N. Shimkin, Q-cut - dynamic discovery of sub-goals
draft. Hüseyin Aydın: Writing – review & editing, Software, Vali- in reinforcement learning, in: Proceedings of the 13th European Conference
dation. Cem İyigün: Writing – review & editing, Supervision. Faruk on Machine Learning, ECML ’02, Springer-Verlag, Berlin, Heidelberg, 2002, pp.
295–306.
Polat: Writing – review & editing, Supervision.
[2] R.S. Sutton, D. Precup, S. Singh, Between MDPs and semi-MDPs: A framework for
temporal abstraction in reinforcement learning, Artificial Intelligence 112 (1–2)
Declaration of competing interest (1999) 181–211, http://dx.doi.org/10.1016/S0004-3702(99)00052-1.
[3] T. Okudo, S. Yamada, Subgoal-based reward shaping to improve efficiency in
reinforcement learning, IEEE Access 9 (2021) 97557–97568, http://dx.doi.org/
The authors declare that they have no known competing finan-
10.1109/ACCESS.2021.3090364.
cial interests or personal relationships that could have appeared to [4] A. Demir, E. Çilden, F. Polat, Landmark based reward shaping in reinforcement
influence the work reported in this paper. learning with hidden states, in: Proceedings of the 18th International Conference
on Autonomous Agents and MultiAgent Systems, AAMAS ’19, International
Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2019,
Data availability
pp. 1922–1924.
[5] M.I. Bal, C. Iyigun, F. Polat, H. Aydin, Population-based exploration in reinforce-
Data will be made available on request. ment learning through repulsive reward shaping using eligibility traces, Ann.
Oper. Res. (2024) 1–33.
[6] P. Dayan, T. Sejnowski, Exploration bonuses and dual control, in: Machine
Acknowledgment
Learning, 1996, pp. 5–22.
[7] D. Rengarajan, G. Vaidya, A. Sarvesh, D. Kalathil, S. Shakkottai, Reinforcement
The authors would like to thank Anastasia Akkuzu for proofreading learning with sparse rewards using guidance from offline demonstration, 2022,
of the manuscript. arXiv preprint arXiv:2202.04628.
[8] S. Paul, J. Vanbaar, A. Roy-Chowdhury, Learning from trajectories via subgoal
discovery, Adv. Neural Inf. Process. Syst. 32 (2019).
Appendix A. Notations [9] T. Brys, A. Harutyunyan, H.B. Suay, S. Chernova, M.E. Taylor, A. Nowé,
Reinforcement learning from demonstration through shaping, in: Twenty-Fourth
The notation used in the paper is given in Table A.4. International Joint Conference on Artificial Intelligence, 2015.
[10] M.J. Mataric, Reward functions for accelerated learning, in: In Proceedings of
the Eleventh International Conference on Machine Learning, Morgan Kaufmann,
Appendix B. Experiment details 1994, pp. 181–189.
[11] O. Marom, B. Rosman, Belief reward shaping in reinforcement learning, in:
Further details about the experimentation of our study (including Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, No.
1, 2018, http://dx.doi.org/10.1609/aaai.v32i1.11741, URL https://ojs.aaai.org/
parameters of Macro Q-Learning and Deep-RL models that we tried for index.php/AAAI/article/view/11741.
the comparisons) are given in the Tables B.5 to B.8. [12] J. Schmidhuber, A possibility for implementing curiosity and boredom in model-
building neural controllers, in: Proceedings of the First International Conference
483
M.İ. Bal et al. Future Generation Computer Systems 157 (2024) 469–484
on Simulation of Adaptive Behavior on from Animals To Animats, MIT Press, [38] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M.
Cambridge, MA, USA, 1991, pp. 222–227. Riedmiller, Playing atari with deep reinforcement learning, 2013, arXiv preprint
[13] D. Pathak, P. Agrawal, A.A. Efros, T. Darrell, Curiosity-driven exploration by arXiv:1312.5602.
self-supervised prediction, in: Proceedings of the 34th International Conference [39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy
on Machine Learning - Volume 70, ICML ’17, JMLR.org, 2017, pp. 2778–2787. optimization algorithms, 2017, arXiv preprint arXiv:1707.06347.
[14] A.Y. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: [40] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-
Theory and application to reward shaping, in: In Proceedings of the Sixteenth Baselines3: Reliable reinforcement learning implementations, J. Mach. Learn.
International Conference on Machine Learning, Morgan Kaufmann, 1999, pp. Res. 22 (268) (2021) 1–8, URL http://jmlr.org/papers/v22/20-1364.html.
278–287. [41] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep
[15] E. Wiewiora, G. Cottrell, C. Elkan, Principled methods for advising reinforcement reinforcement learning that matters, in: Proceedings of the AAAI conference on
learning agents, in: Proceedings of the Twentieth International Conference on artificial intelligence, Vol. 32, 2018.
International Conference on Machine Learning, ICML ’03, AAAI Press, 2003, pp. [42] A. Demir, E. Çilden, F. Polat, Landmark based guidance for reinforcement
792–799. learning agents under partial observability, Int. J. Mach. Learn. Cybern. (2022)
[16] P. Goyal, S. Niekum, R.J. Mooney, Using natural language for reward shaping 1–21.
in reinforcement learning, 2019, arXiv preprint arXiv:1903.02020. [43] A. Demir, Learning what to memorize: Using intrinsic motivation to form useful
[17] Z. Yang, M. Preuss, A. Plaat, Potential-based reward shaping in sokoban, 2021, memory in partially observable reinforcement learning, Appl. Intell. (2023) 1–19.
arXiv arXiv:2109.05022.
[18] M. Grzes, D. Kudenko, Plan-based reward shaping for reinforcement learning,
Knowl. Eng. Rev. 31 (2008) 10–22, http://dx.doi.org/10.1109/IS.2008.4670492.
[19] A. Harutyunyan, S. Devlin, P. Vrancx, A. Nowe, Expressing arbitrary reward Melis Ilayda Bal is a Doctoral Researcher at the Max
functions as potential-based advice, in: Proceedings of the AAAI Conference Planck Institute for Intelligent Systems (MPI-IS), Tübingen.
on Artificial Intelligence, Vol. 29, No. 1, 2015, http://dx.doi.org/10.1609/aaai. She received her B.S. in Industrial Engineering with a
v29i1.9628. minor in Computer Engineering, and M.S. in Operations
[20] S. Devlin, D. Kudenko, Plan-based reward shaping for multi-agent reinforcement Research degrees from the Middle East Technical Uni-
learning, Knowl. Eng. Rev. 31 (1) (2016) 44–58. versity (METU), Ankara in 2019 and 2022, respectively.
[21] B. Huang, Y. Jin, Reward shaping in multiagent reinforcement learning for She is currently a Ph.D. in Computer Science student at
self-organizing systems in assembly tasks, Adv. Eng. Inform. 54 (2022) 101800. CS@maxplanck doctoral program and her research interests
[22] Ö. Şimşek, A.G. Barto, Using relative novelty to identify useful temporal abstrac- include reinforcement learning, multi-agent systems, and
tions in reinforcement learning, in: Proceedings of the Twenty-First International game theory.
Conference on Machine Learning, ICML ’04, ACM, 2004, pp. 95–102.
[23] Ö. Şimşek, A.P. Wolfe, A.G. Barto, Identifying useful subgoals in reinforcement
Hüseyin Aydın is a Teaching and Research Assistant at the
learning by local graph partitioning, in: Proceedings of the 22nd International
Department of Computer Engineering at the Middle East
Conference on Machine Learning, ICML ’05, Association for Computing Ma-
Technical University, Ankara. He is also a visiting researcher
chinery, New York, NY, USA, 2005, pp. 816–823, http://dx.doi.org/10.1145/
at the Department of Information and Computing Science at
1102351.1102454.
the Utrecht University for his Post-Doctoral studies funded
[24] Ö. Şimşek, Behavioral Building Blocks for Autonomous Agents: Description,
by the Scientific and Technological Research Council of
Identification, and Learning (Ph.D. thesis), University of Massachusetts Amherst,
Turkey (TUBITAK). He received B.S., M.S., and Ph.D. de-
2008.
grees from the same department in 2015, 2017, and 2022,
[25] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic
respectively. His research interests include reinforcement
Programming, first ed., John Wiley & Sons, Inc., USA, 1994.
learning, multi-agent systems, artificial intelligence, and
[26] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, A Bradford
hybrid intelligence.
Book, Cambridge, MA, USA, 2018.
[27] G.A. Rummery, M. Niranjan, On-Line Q-Learning Using Connectionist Systems,
Tech. Rep., 1994.
[28] C.J.C.H. Watkins, Learning from Delayed Rewards (Ph.D. thesis), King’s College, Cem İyigün is a Professor in the Department of Indus-
Cambridge, UK, 1989. trial Engineering at the Middle East Technical University
[29] A. McGovern, R.S. Sutton, A.H. Fagg, Roles of macro-actions in accelerating (METU). Prior to joining METU in 2009, he worked as
reinforcement learning, in: Grace Hopper Celebration of Women in Computing, a lecturer and visiting assistant professor in Management
Vol. 1317, 1997, p. 15. Science and Information Systems department at Rutgers
[30] A. McGovern, R.S. Sutton, Macro-actions in reinforcement learning: An empirical Business School. He received his Ph.D. degree in 2007 from
analysis, Comput. Sci. Depart. Fac. Publ. Ser. (1998) 15. Rutgers Center for Operations Research (RUTCOR) at Rut-
[31] A.D. Laud, Theory and application of reward shaping in reinforcement learning gers University. His research interests lie primarily in data
(Ph.D. thesis), University of Illinois at Urbana-Champaign, 2004. analysis, data mining (specifically clustering problems), time
[32] R.S. Sutton, Integrated architectures for learning, planning, and reacting based series forecasting, and their applications to bioinformatics,
on approximating dynamic programming, in: In Proceedings of the Seventh climatology and electricity load forecasting.
International Conference on Machine Learning, Morgan Kaufmann, 1990, pp.
216–224.
Faruk Polat is Professor of Computer Science at the Depart-
[33] S. Amin, M. Gomrokchi, H. Satija, H. van Hoof, D. Precup, A survey of
ment of Computer Engineering at the Middle East Technical
exploration methods in reinforcement learning, 2021, arXiv preprint arXiv:2109.
University, Ankara. Dr. Polat received a B.S. degree in Com-
00157.
puter Engineering from METU in 1987. He received M.S.
[34] H. Iima, Y. Kuroe, Swarm reinforcement learning algorithms based on sarsa
and Ph.D. degrees from Computer Engineering from Bilkent
method, in: 2008 SICE Annual Conference, IEEE, 2008, pp. 2045–2049.
University in 1989 and 1994, respectively. He was a visiting
[35] A. Demir, E. Çilden, F. Polat, Automatic landmark discovery for learning agents
scholar at the Department of Computer Science at the
under partial observability, Knowl. Eng. Rev. 34 (2019) e11.
University of Minnesota, Minneapolis between 1992–1993.
[36] H. Aydın, E. Çilden, F. Polat, Using chains of bottleneck transitions to decompose
His research focuses primarily on artificial intelligence,
and solve reinforcement learning tasks with hidden states, Future Gener. Comput.
autonomous agents and multi-agent systems.
Syst. 133 (2022) 153–168.
[37] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in:
International Conference on Machine Learning, PMLR, 2016, pp. 1928–1937.
484