Hierarchical_Reinforcement_Learning_in_Multi-Domain_Elastic_Optical_Networks_to_Realize_Joint_RMSA
Hierarchical_Reinforcement_Learning_in_Multi-Domain_Elastic_Optical_Networks_to_Realize_Joint_RMSA
Abstract—To improve the network scalability, a large elastic multiplexing (WDM) networks, elastic optical networks (EONs)
optical network is typically segmented into multiple autonomous possess high flexibility and efficiency in spectrum utilization
domains, where each domain possesses high autonomy and privacy. and support larger capacity transmissions. Therefore, EON is
This architecture is referred to as the multi-domain elastic optical
network (MDEON). In the MDEON, the routing, modulation, and considered to be a competitive candidate for the next-generation
spectrum allocation (RMSA) for the inter-domain service requests backbone infrastructure [2], [3], [4]. In practical situations, com-
are challenging. As a result, deep reinforcement learning (DRL) munication nodes in the network are geographically distributed
has been introduced recently where the RMSA policies are learned and/or managed by different network operators [5]. In order
during the interaction of the DRL agents with the MDEON envi-
to improve the network scalability, a large optical network is
ronment. Due to the autonomy of each domain in the MDEON,
joint RMSA is essential to improve the overall performance. To typically segmented into multiple autonomous domains, where
realize the joint RMSA, we propose a hierarchical reinforcement each domain possesses high autonomy and privacy. It is gen-
learning (HRL) framework which consists of a high-level DRL erally referred to as a multi-domain elastic optical network
module and multiple low-level DRL modules (one for each domain), (MDEON) [6], [7], [8], [9], [10].
with the collaboration of DRL modules. More specifically, for the
The network resource allocation in optical networks can be
inter-domain service requests, the high-level module obtains some
abstracted information from the low-level DRL modules and gen- described as the routing, modulation and spectrum allocation
erates the inter-domain RMSA decision for the low-level modules. (RMSA) problem. Even for the single domain EON, the RMSA
Then the low-level DRL module gives the intra-domain RMSA problem is known as NP-hard [11]. As a result, conventional
decision and feeds back to the high-level module. The proposed studies have focused on the design of various heuristic algo-
HRL framework preserves the autonomy of each single domain
rithms, aiming to improve the overall network performance [12],
while delivering effective overall network performance through the
cooperation of the high-level and low-level DRL modules. Simula- [13]. Classical algorithms include the K-shortest path rout-
tion results demonstrate that our proposed method outperforms ing [13], the distance-aware modulation [14], and the first-fit
previous approaches. spectrum allocation [15]. These heuristic solutions are based
Index Terms—Routing, modulation and spectrum allocation, on handcrafted rules that are supposed to lead to good per-
multi-domain elastic optical network, hierarchical reinforcement formance. However, the relationship between the rules and the
learning. final performance is indirect, and therefore, it is difficult to find
the optimal solution, especially in the complicated EON envi-
I. INTRODUCTION ronment. In recent years, deep reinforcement learning (DRL)
HE rising demand of new network services, such as has been applied in dynamic elastic optical networks to tackle
T cloud computing, video conferencing and live broad-
cast, poses great challenges to the bearing communication net-
the RMSA problem [16], [17], [18], [19]. Different from the
heuristic solutions, the DRL-based approaches learn directly
work [1]. Comparing with the conventional wavelength division from the feedback of the network performance. As a result, DRL
has been recognized as a promising method to approach the
Manuscript received 11 October 2022; revised 14 December 2022; accepted optimal resource allocation strategy in the complicated EON
27 December 2022. Date of publication 9 January 2023; date of current version
17 April 2023. This work was supported in part by the National Natural environment. The performance of DRL-based approaches is
Science Foundation of China under Grant 62006084 and in part by Guangdong demonstrated better than classical heuristic algorithms (e.g.,
Basic and Applied Basic Research Foundation under Grant 2020A1515111110. K-shorted path routing and first-fit spectrum allocation), and
(Corresponding author: Yue-Cai Huang.)
Liufei Xu is with the School of Physics and Telecommunication Engi- various studies have been done to explore their advantages [20],
neering, South China Normal University, Guangzhou 510006, China (e-mail: [21], [22]. By far, most DRL-based RMSA studies have been
[email protected]). applied to single-domain scenarios, where a controller is aware
Yue-Cai Huang, Yun Xue, and Xiaohui Hu are with the School of Elec-
tronics and Information Engineering, South China Normal University, Fos- of the whole network conditions.
han 528200, China (e-mail: [email protected]; [email protected]; As mentioned earlier, MDEON should be considered in prac-
[email protected]). tical situations [23]. In the MDEON, to ensure the autonomy of
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JLT.2023.3235039. each domain, the domain controller does not disclose detailed
Digital Object Identifier 10.1109/JLT.2023.3235039 intra-domain information to other domains. So the coordination
0733-8724 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2277
between the domain controllers is challenging for the inter- them are based on a hierarchical control architecture. Casellas
domain service requests [8], [24]. Currently, using DRL to solve et al. [27] introduced the IDEALIST project, which includes
the service request configuration problem of the MDEON has the design and implementation of a hierarchical control plane
begun to arouse researchers’ interest [25], [26]. Li et al. [25] for the multi-domain EON, while the detailed RMSA solutions
proposed DeepCoop, where each domain is autonomously man- are not exploited. Based on the hierarchical control architec-
aged by one DRL agent. For inter-domain service requests, ture, various heuristic RMSA strategies have been proposed.
DeepCoop first uses a domain-level path computation element Chen et al. [28] demonstrated a multi-domain EON testbed and
(PCE) to obtain the sequence of the domains that the request goes the RMSA is based on the conventional shortest path routing
through, and then the agents take charge of the intra-domain or random fit. Hirota et al. [23] defined a score for multiple
resource provision. During the provision, adjacent agents ex- routes based on the number of available frequency slots and the
change information and cooperate with each other. This work number of fibers and then chose the route with the maximum
extends DRL-based RMSA to multi-domain cases. However, score. Chen et al. [29] introduced a spectral cut to measure the
the domain-level path computation element (PCE) is still based level of spectrum fragmentation and proposed a fragmentation-
on heuristic rules, and the cooperation is confined between aware RSA by choosing the path with the minimum spectral
adjacent domains. To further exploit the advantages of DRL cut. Zhu et al. [30] assigned weights to measure the spectrum
in the MDEON, we propose here a hierarchical reinforcement usage and power consumption for the RMSA decisions and
learning framework for the multi-domain elastic optical network tried to find the one with the minimum weight. Le et al. [31]
to realize joint RMSA. checked the routes following the order of breadth-first search
The proposed hierarchical reinforcement learning (HRL) until the QoT and fragmentation requirements are satisfied.
framework consists of a high-level DRL module and multiple
Batham et al. [32] defined a so-called holding pathlength domain
low-level DRL modules, one for each domain. Low-level mod- factor (HPDF) based on three parameters: the holding time
ules take charge of intra-domain service requests in correspond- of the request, the pathlength of the route and the number of
ing domains. The high-level module and low-level modules
domains across the route. It served the traffic request in the
cooperatively handle the inter-domain service requests. Specifi- ascending order of HPDF. In general, the idea of the above
cally, the high-level module does not have detailed information heuristic RMSA strategies is to first define some metrics and
of each domain, but it obtains some abstracted information from
then optimize the problem based on these metrics. However, the
low-level DRL modules. Based on this limited information, the relationship between these user-defined metrics and the final
high-level module determines a sequence of domains that the performance is indirect, and therefore, it is difficult to find
inter-domain service request goes through. This sequence is
the optimal solution, especially in the complicated MDEON
given to the low-level DRL modules. Low-level DRL modules environment.
determine the intra-domain resource allocation strategies and
return the provision results to the high-level DRL module. The
proposed HRL framework preserves the autonomy of each single B. Machine Learning-Based RMSA
domain while delivering effective overall network performance
Machine learning has been exploited in the RMSA of the
through the cooperation of the high-level and low-level DRL
MDEON recently. Chen et al. [10], [33] proposed a distributed
modules.
collaborative learning approach for the MDEON. It calculated
We have conducted extensive simulations on the MDEON to
K shortest paths and selected the QoT guaranteed one by jointly
evaluate our proposal and its effectiveness is demonstrated in
minimizing the current resource utilization and predicted traffic
the following three aspects. (1) The proposed approach outper-
on all traversed inter-domain links. The traffic prediction and
forms classical heuristic algorithms and state-of-art DRL-based
QoT estimation were done by machine learning. Similar to these
approaches. (2) Cooperation between the high-level and low-
heuristic methods, the relationship between interested metrics
level modules improves the overall performance. (3) The pro-
such as the QoT and the final performance is indirect. As a
posed approach possesses good universality in different network
result, the optimization was done in an indirect way. Li et al. [34]
topologies and different traffic loads.
introduced DeepMDR, which is a deep learning-assisted control
The rest of this paper is organized as follows. Section II sur-
plane system to realize scalable path computation in multi-
veys the related work. Section III describes the problem formu-
domain packet networks. It first collected training samples
lation of the RMSA problem in the MDEON. The architecture
whose routes are obtained based on a genetic algorithm and
of our proposed approach is presented in Section IV. Section V
then used these samples to train a neural network to generate
compares the performance of our proposed approach with other
routes for new requests. Zhong et al. [26] proposed a control
state-of-the-art methods, and finally, Section VI concludes this
plane architecture that enables AI-based routing. It learns from
paper.
sparse historical route trajectories and trains a deep-learning
model that can directly return a feasible inter-domain route when
II. RELATED WORK being requested. In [26], [34], RMSA decisions are given by
A. Heuristic RMSA Strategies neural networks, while the training samples are obtained from
user-generated data or historical data. With such supervised
In the past decade, research has been done on the routing,
learning, the learned model can only fit the given data, instead
modulation and spectrum allocation in MDEONs. Most of
of optimizing the performance.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2278 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023
A. Framework Overview
Fig. 3. Example on constructing the virtual topology (fiber distance in kilo-
The overall architecture of the HRL is shown in Fig. 2, meters).
which utilizes the hierarchical control plane for the inter-domain
service provisioning. Each domain has a low-level agent which
interacts with its corresponding domain environment and makes
intra-domain RMSA actions. All low-level agents communicate high-level state and selects a proper sequence consisting of
with a shared high-level agent. Meanwhile, as shown in Fig. 2, virtual nodes (domain i in the physical topology denoted as Di )
the physical topology of all domains can be abstracted to a high- and inter-domain links. For instance, the following sequence is
level virtual topology, by simplifying each domain as a virtual chosen, D1 → ẽ1,3 3 3,2
6,1 → D → ẽ4,3 → D .
2
node and connecting the virtual nodes by inter-domain links. The This sequence is sent to the low-level agents, and the ingress
high-level agent also interacts with this high-level virtual EON node or the egress node of each domain is obtained from the se-
environment which carries the abstracted information from all lected inter-domain links, therefore, the source-destination pair
domains and makes the RMSA action in the virtual topology. for each intra-domain RMSA can be determined. For Domain 1,
The configuration of inter-domain service requests is as the egress node is v61 , and the source-destination pair is v31 , v61 .
3 3
follows. Upon each service request coming, the high-level For Domain 3, the ingress and egress 3nodes
3
are v1 and v4 , and
agent gets the current high-level state from the virtual topology then the source-destination pair is v1 , v4 . For Domain 2, the
and from the low-level agents. It then emits a high-level action, ingress node is v32 . Since this node is the same as the destination
(i.e., selecting a sequence consisting of virtual nodes and node, no intra-domain RMSA is needed for this domain. With
inter-domain links from pre-defined candidate sequences.) the source-destination pairs given, the candidate paths of the
It then coordinates with the low-level agents to determine intra-domain RMSA scheme in each domain can be calculated,
the ingress/egress nodes of the inter-domain links in each with the bandwidth of the request. Then the low-level agents
related domain. These ingress/egress nodes are defined as execute the intra-domain RMSA scheme separately.
the source/destination nodes for the subsequent intra-domain
RMSA. Then low-level agent gets the current low-level state
B. High-Level DRL Module
within its domain and emits a low-level action, i.e., RMSA
decision in each of the corresponding domains. Finally, the As we have explained in the previous section, the high-level
whole inter-domain lightpath can be set up accordingly for agent needs to select a virtual path for each inter-domain service
the request. Based on the setting up results, the reward for the request. Hence, its action not only affects the choice of the
low-level agents and the high-level agent is designed. Note that virtual path but also affects the service provisioning in each
instead of making their decisions independently, the low-level chosen domain. To this end, we design the state of the high-level
agents actually cooperate with the high-level agent to improve agent to include some connectivity information about the chosen
the performance of inter-domain service provisioning. domains and formulate the reward functions to consider the
For example, for the MDEON shown in Fig. 3, the physical feedback from the chosen domains. The high-level DRL module
topology can be abstracted to a high-level virtual topology. is detailed as follows.
Assume there is an inter-domain service request with 50 Gbps 1) High-Level State sH : The high-level state sH is a vec-
from the source node v31 in Domain 1 to the destination node v32 tor representing the high-level environment information, which
in Domain 2. First of all, the high-level agent gets the current includes the information of the incoming service request and
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2280 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023
also the current network status. For the incoming service re- TABLE II
HIGH-LEVEL REWARD
quest, knowing the source domain and the destination domain
in the virtual topology, K H candidate virtual paths can be
obtained. As mentioned in Section IV-A, each path is represented
by a sequence of virtual nodes and inter-domain links. For
the example shown in Fig. 3. The source domain is D1 and
the destination domain is D2 . Then, we can obtain 6 candidate
virtual paths sorted in ascending order by the summation of
inter-domain link distance:
r virtual path 1: D1 → ẽ1,2 → D2
2,1
r virtual path 2: D1 → ẽ1,2 → D2 the above-mentioned virtual path as an example, its feasible
6,2
r virtual path 3: D1 → ẽ1,3 → D3 → ẽ3,2 → D2 probability is P D1 |in = v31 , out = v61 . Recall that no O/E/O
6,1 4,3
r virtual path 4: D1 → ẽ1,3 → D3 → ẽ3,2 → D2 conversions are applied for the intra-domain transmissions,
6,1 1,4 therefore, spectrum contiguity and continuity constraints should
r virtual path 5: D1 → ẽ1,3 → D3 → ẽ3,2 → D2
5,3 4,3 be both satisfied during the spectrum allocation.
r virtual path 6: D1 → ẽ1,3 → D3 → ẽ3,2 → D2 2) High-Level Action aH : As mentioned in Section IV-B1,
5,3 1,4
For each virtual path k, k ∈ {1, 2, . . . , K H }, we define a fea- for the incoming service request, with the source domain and the
sible probability Pk , which will be detailed later. The high-level destination domain given in the virtual topology, K H candidate
state is defined by, virtual paths can be determined. A high-level action aH is to
choose one virtual path from the K H candidate paths. Then, for
sH = [Esrc , Edst , B, P1 , P2 , . . . , PK H ]. (2)
each inter-domain link on the virtual path, the modulation format
In the above definition, Esrc and Edst are one-hot vectors, can be determined by the link distance according to Table I. Next,
representing the source domain and the destination domain of the with the modulation format chosen, the number of FSs needed
incoming service request. Each element in Esrc /Edst represents for each inter-domain link can be calculated by (1). Finally,
a node in the virtual topology, where the source/destination node for each inter-domain link, whether it can support the service
is given 1 and other nodes are 0 s. B denotes the bandwidth request with enough available FSs should be checked based on
demand in Gbps. Pk , k ∈ {1, 2, . . . , K H }, is the feasible prob- the spectrum contiguity constraint. Besides, the intra-domain
ability of virtual path k. transmission will be handled by the low-level agents in all related
Recall that each virtual path is represented by a sequence of domains restrained by the spectrum continuity and contiguity
virtual nodes and inter-domain links. The feasible probability constraints (detailed in Section IV-C). If the whole virtual path
of the virtual path is defined as the product of the feasible can be successfully set up, the service request is accepted.
probability of the related virtual nodes and inter-domain links. Otherwise, it will be blocked.
For example, for the virtual path 3: D1 → ẽ1,3 3 3,2
6,1 → D → ẽ4,3 → 3) High-Level Reward rH : The high-level reward rH is a
D2 , the feasible probability is calculated by, scalar returned to the high-level agent, which evaluates the
instantaneous benefit of the action aH taken. The objective of
P3 = P D1 |in = v31 , out = v61 × Pẽ1,3 inter-domain service provisioning is to minimize the blocking
6,1
3 3 3
probability of these service requests. Therefore, the reward for
× P D |in = v1 , out = v4 × Pẽ3,2
4,3 the high-level agent rH is relevant to whether the high-level
2 2 2
agent chooses an appropriate virtual path and whether the related
× P D |in = v3 , out = v3 . (3)
low-level agents choose proper RMSA schemes in correspond-
In the above equation, for each inter-domain link, if it satisfies ing domains. Generally, if the inter-domain service request can
the bandwidth demand of the inter-domain service request, its be successfully set up, the high-level agent will receive a positive
feasible probability is set to be 1, otherwise, it is set to be reward, otherwise, it receives a negative reward.
0. Recall that O/E/O conversions are applied on the border The detailed reward setting is given in Table II. If the incoming
nodes, for the inter-domain links, only the spectrum contiguity service request is successfully accepted, we consider three cases.
constraint needs to be satisfied during the spectrum allocation. (1) The feasible probability of the chosen virtual path is the
The feasible probability of the related virtual node Di is largest among all virtual paths, i.e., PaH = max Pk , the reward
defined as, is set to be a large positive value. (2) The feasible probability of
P Di |in = vm i
, out = vni = Km,ni,feasible i
/Km,n , (4) the chosen virtual path is the smallest among all virtual paths,
i.e., PaH = min Pk , the reward is set to be a small positive
i
where vm is the ingress node or source node in Domain i, value. (3) The feasible probability of the chosen virtual path
i
vn is the egress node or the destination node in Domain i, is moderate among all virtual paths, i.e., PaH = max Pk and
i
Km,n represents the number of candidate intra-domain paths PaH = min Pk , the reward is set to be a moderate positive value.
i
connecting vm and vni , and Km,n
i,feasible
are the number of candidate If the incoming service request is blocked, we consider two
i
paths satisfying the service request. If vm and vni are the same cases. (1) The feasible probability of the chosen virtual path is the
nodes, i.e., m = n, the feasible probability is set to be one. This smallest among all virtual paths, i.e., PaH = min Pk , the reward
definition reflects the congestion level between the ingress node is set to be a large negative value. (2) The feasible probability of
and the egress node in the physical topology. Take Domain 1 in the chosen virtual path is not the smallest among all virtual paths,
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2281
From the virtual path given by the high-level agent, the ingress
node and egress node in this domain can be obtained as the source
node and destination node for the intra-domain transmission.
i
The source node and the destination node are denoted by Esrc 2) Low-Level Action ai : In Domain i, the low-level action
i i i
and Edst , which are one-hot vectors. Each element in Esrc /Edst a is to select one path from K i candidate paths according
i
represents a node in Domain i, where the source/destination to the source node and the destination node of the current
node is given 1 and other nodes are 0 s. For example in Fig. 3, domain. Next, the modulation format is determined by the path
for Domain 1, the source node is v31 and the destination node is distance according to Table I. Then, with the modulation format
v61 , so Esrc
1 1
= [0, 0, 1, 0, 0, 0] and Edst = [0, 0, 0, 0, 0, 1]. chosen, the number of FSs needed can be calculated by (1).
With the source node and destination node determined in Finally, whether the selected path can support the service request
Domain i, K i shortest candidate paths can be calculated. For dif- with enough available FSs should be checked based on the
ferent domains, the number of candidate paths can be different. spectrum continuity and contiguity constraints. For the intra-
For each candidate path k = 1, 2, . . . , K i , we define Fki , which domain service request, if there are enough available FSs, the
includes 7 features [19] of the spectrum availability information. service request is accepted by allocating the spectrum with the
r Nb , the total number of required FSs by the service request, first-fit rule. Otherwise, it will be blocked. For the inter-domain
which is calculated from (1) and Table I service request, if there are enough available FSs in all related
r NFS , the total number of available FSs domains and inter-domain links, the service request is accepted
r NFSB , the total number of available FS-blocks by allocating the spectrum with the first-fit rule. Otherwise, it
r N , the total number of available FS-blocks satisfying will be blocked.
FSB
the service request 3) Low-Level Reward ri : In Domain i, the low-level reward
r Istart , the starting index of the first available FS-block (first- i
r is a scalar returned to the low-level agent i, which evaluates
fit rule related) satisfying the service request the instantaneous benefit of the action ai taken. For actions
r Sfirst , the size of the first available FS-block (first-fit rule handling the inter-domain service requests, the detailed reward
related) satisfying the service request setting is given in Table IV. If the incoming service request is
r SFSB , the average size of the available FS-blocks successfully accepted, the reward is set to be a positive value. If
For the example in Fig. 4, the values of the spectrum availabil- the incoming service request is blocked, we consider the effect of
ity information Fki , ∀k ∈ {1, 2, . . ., K i } are given by Table III. the high-level action in three cases. (1) The feasible probability
In the table, when Istart = −1 and Sfirst = −1, it means the of the chosen virtual path by the high-level agent is the largest
first available FS-block (first-fit rule related) cannot satisfy the among all virtual paths, i.e., PaH = max Pk , the reward is set
service request. Finally, for K i candidate paths, there are 7 × K i to be a large negative value. (2) The feasible probability of
features for the spectrum information. the chosen virtual path is the smallest among all virtual paths,
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2282 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023
i.e., PaH = min Pk , the reward is set to be a small negative where 0 < γ < 1 is the discount factor, and Gt is called the
value. (3) The feasible probability of the chosen virtual path return. Then, the value function V (st ) has the following rela-
is moderate among all virtual paths, i.e., PaH = max Pk and tionship with its n-step value function V (st+n ),
PaH = min Pk , the reward is set to be a moderate negative value. n−1
The reasons for the above reward setting in the case of block- V (st ) = E γ i rt+i + γ n V (st+n ) . (7)
ing are as follows. The feasible probability of the virtual path i=0
reflects the congestion level of the virtual path. A larger feasible
probability means that the connectivity between the ingress node Therefore, n−1i=0 γ rt+i + γ V (st+n ) can be regarded as an
i n
and the egress node inside the relevant domains is better and estimation of the value function. As a result, during the training,
therefore it is easier for the low-level agent to choose an action it is used as the learning target,
that accepts the service. In this situation, if the service request is n−1
finally blocked, we give a large penalty to the low-level agent. Target V (st ; θv ) = γ i rt+i + γ n V (st+n ; θv ). (8)
On the contrary, a smaller feasible probability means that the i=0
connectivity between the ingress node and the egress node inside For the state-action pair (st , at ), the advantage function is
the relevant domains is worse and therefore it is more difficult for defined to measure how much better the actually chosen action
the low-level agent to choose an action that accepts the service. is than average actions. The advantage function, therefore, can
In this situation, if the service request is finally blocked, we give be defined to be the difference between the target value function
a small penalty to the low-level agent. (measuring the return of the actually chosen action given by
Intra-domain service requests are only dealt by the corre- the current policy π(at |st ; θ)) and the estimated value function
sponding low-level agent. The state representation and actions V (st ) (measuring the average return of all possible actions). The
are the same as that for the inter-domain service request as advantage function is given by,
introduced in Section IV-C1 and Section IV-C2. The reward
n−1
setting is different since it does not need to consider the actions
of the high-level agent. Therefore, the reward is simply set to be A (st , at ; θ, θv )|n = γ i rt+i
i=0
1 if it is accepted and −1 if it is blocked.
+ γ n V (st+n ; θv ) − V (st ; θv ). (9)
In A3C, each local learner learns asynchronously as follows.
First, the local learner gets parameters from the global learner.
D. The Training Process of the HRL Then it collects T consecutive samples (st , at , rt ), t = {t0 , t0 +
1, . . . , t0 + T − 1} and stores them to an episode buffer, during
The A3C algorithm is used for the training of the high-level the interaction with its environment. These samples then are used
agent and the low-level agents. The high-level agent and the for the training. It computes the advantage function for each of
low-level agents collaborate during the training. For the sake the state-action pairs from the samples. Each advantage function
of readability, we first introduce the background of the A3C uses the longest possible n-step return from this episode. Specif-
algorithm which can be used for the training of one agent in ically, the first state uses its T -step return; the second state uses
general without collaboration. Then, we explain the detailed its (T − 1)-step return, and the last state uses its one-step return.
collaborations of the agents. After that, the gradients are accumulated. Notice that the value
1) Background (The A3C Algorithm): Each agent includes a network is updated with the objective to minimize the mean
number of local learners, interacting with separate environment squared error of the estimated value function V (st ; θv ) and the
copies. In addition, each agent has a global learner, which estimated target value function, i.e., the mean squared error of
updates its parameters according to the results learned by local the advantage function. Therefore the accumulated gradient is
learners. For one learner, it consists of an actor, which outputs given by (10). The policy network is updated with the objective
the action according to the input state, and a critic, which to maximize the average reward per time-step, with its gradient
measures the goodness of the input state. The actor and the critic given by (11), according to [36].
are parameterized by two neural networks, called the policy
t0 +T −1
network and the value network, respectively. For the global
learner, the parameters of the actor and the critic are denoted grad(θv ) = ∇θv A (st , at ; θ, θv )|T −(t−t0 ) 2 , (10)
as Θ and Θv . For one local learner, they are denoted as θ and θv . t=t0
For the local learner, their output are π(at |st ; θ) and V (st ; θv ), t0 +T −1
respectively. grad(θ) = ∇θ log π(at |st ; θ)
The value function V (st ) is defined to be the expected cumu- t=t0
lative discounted reward starting from state st ,
× A (st , at ; θ, θv )|T −(t−t0 ) + β∇θ H (π(st ; θ)).
(11)
H (π(st ; θ)) is the entropy of the policy distribution [36]. β
∞ controls the strength of the entropy regularization term.
V (st ) = E [Gt ] = E γ i rt+i , (6) These gradients are then transmitted to the global learner to
i=0 update its parameters. Specifically, the parameters for the value
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2283
TABLE V
NOTATIONS FOR THE PARAMETERS OF THE HIGH-LEVEL AGENT AND THE Algorithm 1: Learning Procedures of One MDEON Envi-
LOW-LEVEL AGENT i ronment Copy.
Θv ← Θv − η · grad(θv ), (12)
Θ ← Θ + α · grad(θ). (13)
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2284 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023
TABLE VI use a softmax output for the policy network and a linear output
COMMON PARAMETERS USED IN ALL SIMULATIONS
for the value network, with all non-output layers activated by
ReLU [36].
Notice that, at the beginning of the simulation, the entire
elastic optical network is empty. In order to exclude the initial
transient phase, we have a warm-up period of 3,000 service
requests, and then calculate the blocking probability for every
1,000 requests.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2285
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2286 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2287
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2288 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023
[5] L. Velasco, L. Gifre, and A. Castro, “Brokered orchestration for end-to-end [23] Y. Hirota, H. Tode, and K. Murakami, “Multi-fiber based dynamic spec-
service provisioning across heterogeneous multi-operator networks,” in trum resource allocation for multi-domain elastic optical networks,” in
Proc. 19th Int. Conf. Transparent Opt. Netw., 2017, pp. 1–4. Proc. 18th Optoelectron. Commun. Conf. Held Jointly Int. Conf. Photon.
[6] G. Liu et al., “Hierarchical learning for cognitive end-to-end service Switching, 2013, pp. 1–2.
provisioning in multi-domain autonomous optical networks,” J. Lightw. [24] S. Li, B. Li, and Z. Zhu, “To cooperate or not to cooperate: Service function
Technol., vol. 37, no. 1, pp. 218–225, Jan. 2019. chaining in multi-domain edge-cloud elastic optical networks,” in Proc.
[7] Z. Zhu et al., “OpenFlow-assisted online defragmentation in single-/multi- Conf. Opt. Fiber Commun., 2022, pp. 1–3.
domain software-defined elastic optical networks,” J. Opt. Commun. Netw., [25] B. Li, R. Zhang, X. Tian, and Z. Zhu, “Multi-agent and cooperative deep
vol. 7, no. 1, pp. A7–A15, 2015. reinforcement learning for scalable network automation in multi-domain
[8] L. Sun, X. Chen, and Z. Zhu, “Multibroker-based service provisioning in SD-EONs,” IEEE Trans. Netw. Serv. Manag, vol. 18, no. 4, pp. 4801–4813,
multidomain SD-EONs: Why and how should the brokers cooperate with Dec. 2021.
each other,” J. Lightw. Technol., vol. 35, no. 17, pp. 3722–3733, Sep. 2017. [26] Z. Zhong, N. Hua, Z. Yuan, Y. Li, and X. Zheng, “Routing without rout-
[9] Y. Zong, C. Feng, Y. Guan, Y. Liu, and L. Guo, “Virtual network embedding ing algorithms: An AI-based routing paradigm for multi-domain optical
for multi-domain heterogeneous converged optical networks: Issues and networks,” in Proc. Conf. Opt. Fiber Commun., 2019, pp. 1–3.
challenges,” Sensors, vol. 20, no. 9, 2020, Art. no. 2655. [27] R. Casellas et al., “A control plane architecture for multi-domain elastic
[10] X. Chen, R. Proietti, H. Lu, A. Castro, and S. J. B. Yoo, “Knowledge- optical networks: The view of the IDEALIST project,” IEEE Commun.
based autonomous service provisioning in multi-domain elastic optical Mag., vol. 54, no. 8, pp. 136–143, Aug. 2016.
networks,” IEEE Commun. Mag., vol. 56, no. 8, pp. 152–158, Aug. 2018. [28] C. Chen et al., “Demonstration of openflow-controlled cooperative re-
[11] K. Christodoulopoulos, I. Tomkos, and E. A. Varvarigos, “Elastic band- source allocation in a multi-domain SD-EON testbed across multiple
width allocation in flexible OFDM-based optical networks,” J. Lightw. nations,” in Proc. The Euro. Conf. Opt. Commun., 2014, pp. 1–3.
Technol., vol. 29, no. 9, pp. 1354–1366, May 2011. [29] X. Chen, C. Chen, D. Hu, S. Ma, and Z. Zhu, “Multi-domain
[12] H. A. Dinarte, B. V. Correia, D. A. Chaves, and R. C. Almeida, “Routing fragmentation-aware RSA operations through cooperative hierarchical
and spectrum assignment: A metaheuristic for hybrid ordering selection in controllers in SD-EONs,” in Proc. Asia Commun. Photon. Conf., 2015,
elastic optical networks,” Comput. Netw., vol. 197, 2021, Art. no. 108287. pp. 1–3.
[13] B. C. Chatterjee, N. Sarma, and E. Oki, “Routing and spectrum allocation in [30] J. Zhu, X. Chen, D. Chen, S. Zhu, and Z. Zhu, “Service provisioning
elastic optical networks: A tutorial,” IEEE Commun. Surveys Tuts., vol. 17, with energy-aware regenerator allocation in multi-domain EONs,” in Proc.
no. 3, pp. 1776–1800, Jul.–Sep. 2015. IEEE Glob. Commun. Conf., 2015, pp. 1–6.
[14] A. Bocoi, M. Schuster, F. Rambach, M. Kiese, C.-A. Bunge, and B. [31] H.-C. Le, D. T. Hai, and N. T. Dang, “Development of dynamic QoT-
Spinnler, “Reach-dependent capacity in optical networks enabled by aware lightpath provisioning scheme with flexible advanced reservation
OFDM,” in Proc. Conf. Opt. Fiber Commun., 2009, pp. 1–3. for distributed multi-domain elastic optical networks,” in Proc. 10th Int.
[15] E. E. Moghaddam, H. Beyranvand, and J. A. Salehi, “Routing, spectrum Symp. Inf. Commun. Technol., 2019, pp. 210–215.
and modulation level assignment, and scheduling in survivable elastic op- [32] D. Batham and D. S. Yadav, “HPDST: Holding pathlength domain sched-
tical networks supporting multi-class traffic,” J. Lightw. Technol., vol. 36, uled traffic strategy for multi-domain optical networks,” Optik, vol. 222,
no. 23, pp. 5451–5461, Dec. 2018. 2020, Art. no. 165145.
[16] X. Chen, B. Li, R. Proietti, H. Lu, Z. Zhu, and S. J. B. Yoo, “DeepRMSA: [33] X. Chen, B. Li, R. Proietti, C.-Y. Liu, Z. Zhu, and S. J. Ben Yoo,
A deep reinforcement learning framework for routing, modulation and “Demonstration of distributed collaborative learning with end-to-end QoT
spectrum assignment in elastic optical networks,” J. Lightw. Technol., estimation in multi-domain elastic optical networks,” Opt. Exp., vol. 27,
vol. 37, no. 16, pp. 4155–4163, Aug. 2019. no. 24, pp. 35700–35709, 2019.
[17] Y.-C. Huang, J. Zhang, and S. Yu, “Self-learning routing for optical net- [34] D. Li, H. Fang, X. Zhang, J. Qi, and Z. Zhu, “DeepMDR: A deep-
works,” in Proc. Int. IFIP Conf. Opt. Netw. Des. Model., 2019, pp. 467–478. learning-assisted control plane system for scalable, protocol-independent,
[18] B. Tang, J. Chen, Y.-C. Huang, Y. Xue, and W. Zhou, “Optical network and multi-domain network automation,” IEEE Commun. Mag., vol. 59,
routing by deep reinforcement learning and knowledge distillation,” in no. 3, pp. 62–68, Mar. 2021.
Proc. Asia Commun. Photon. Conf., 2021, pp. 1–3. [35] W. Lu and Z. Zhu, “Dynamic service provisioning of advance reservation
[19] L. Xu, Y.-C. Huang, Y. Xue, and X. Hu, “Deep reinforcement learning- requests in elastic optical networks,” J. Lightw. Technol., vol. 31, no. 10,
based routing and spectrum assignment of EONs by exploiting GCN pp. 1621–1627, May 2013.
and RNN for feature extraction,” J. Lightw. Technol., vol. 40, no. 15, [36] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
pp. 4945–4955, Aug. 2022. in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
[20] B. Tang, Y.-C. Huang, Y. Xue, and W. Zhou, “Deep reinforcement learning- [37] J. Li, D. Wang, M. Zhang, and S. Cui, “Digital twin-enabled self-evolved
based RMSA policy distillation for elastic optical networks,” Mathematics, optical transceiver using deep reinforcement learning,” Opt. Lett., vol. 45,
vol. 10, no. 18, 2022, Art. no. 3293. no. 16, pp. 4654–4657, 2020.
[21] N. E. D. E. Sheikh, E. Paz, J. Pinto, and A. Beghelli, “Multi-band pro- [38] S. Cui, D. Wang, J. Li, and M. Zhang, “Dynamic programmable optical
visioning in dynamic elastic optical networks: A comparative study of a transceiver configuration based on digital twin,” IEEE Commun. Lett.,
heuristic and a deep reinforcement learning approach,” in Proc. Int. Conf. vol. 25, no. 1, pp. 205–208, Jan. 2021.
Opt. Netw. Des. Model., 2021, pp. 1–3. [39] D. Wang et al., “The role of digital twin in optical communication:
[22] B. Tang, Y.-C. Huang, Y. Xue, and W. Zhou, “Heuristic reward design Fault management, hardware configuration, and transmission simulation,”
for deep reinforcement learning-based routing, modulation and spectrum Comm. Mag., vol. 59, no. 1, pp. 133–139, 2021.
assignment of elastic optical networks,” IEEE Commun. Lett., vol. 26,
no. 11, pp. 2675–2679, Nov. 2022.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.