0% found this document useful (0 votes)
8 views13 pages

Hierarchical_Reinforcement_Learning_in_Multi-Domain_Elastic_Optical_Networks_to_Realize_Joint_RMSA

This document presents a hierarchical reinforcement learning (HRL) framework for joint routing, modulation, and spectrum allocation (RMSA) in multi-domain elastic optical networks (MDEONs). The framework consists of a high-level DRL module that coordinates multiple low-level DRL modules, each managing intra-domain requests, to enhance overall network performance while maintaining domain autonomy. Simulation results indicate that the proposed HRL approach outperforms traditional heuristic methods and other DRL-based strategies in various network scenarios.

Uploaded by

wchangzhengstu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views13 pages

Hierarchical_Reinforcement_Learning_in_Multi-Domain_Elastic_Optical_Networks_to_Realize_Joint_RMSA

This document presents a hierarchical reinforcement learning (HRL) framework for joint routing, modulation, and spectrum allocation (RMSA) in multi-domain elastic optical networks (MDEONs). The framework consists of a high-level DRL module that coordinates multiple low-level DRL modules, each managing intra-domain requests, to enhance overall network performance while maintaining domain autonomy. Simulation results indicate that the proposed HRL approach outperforms traditional heuristic methods and other DRL-based strategies in various network scenarios.

Uploaded by

wchangzhengstu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2276 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO.

8, APRIL 15, 2023

Hierarchical Reinforcement Learning in


Multi-Domain Elastic Optical Networks
to Realize Joint RMSA
Liufei Xu , Yue-Cai Huang , Member, IEEE, Yun Xue , and Xiaohui Hu

Abstract—To improve the network scalability, a large elastic multiplexing (WDM) networks, elastic optical networks (EONs)
optical network is typically segmented into multiple autonomous possess high flexibility and efficiency in spectrum utilization
domains, where each domain possesses high autonomy and privacy. and support larger capacity transmissions. Therefore, EON is
This architecture is referred to as the multi-domain elastic optical
network (MDEON). In the MDEON, the routing, modulation, and considered to be a competitive candidate for the next-generation
spectrum allocation (RMSA) for the inter-domain service requests backbone infrastructure [2], [3], [4]. In practical situations, com-
are challenging. As a result, deep reinforcement learning (DRL) munication nodes in the network are geographically distributed
has been introduced recently where the RMSA policies are learned and/or managed by different network operators [5]. In order
during the interaction of the DRL agents with the MDEON envi-
to improve the network scalability, a large optical network is
ronment. Due to the autonomy of each domain in the MDEON,
joint RMSA is essential to improve the overall performance. To typically segmented into multiple autonomous domains, where
realize the joint RMSA, we propose a hierarchical reinforcement each domain possesses high autonomy and privacy. It is gen-
learning (HRL) framework which consists of a high-level DRL erally referred to as a multi-domain elastic optical network
module and multiple low-level DRL modules (one for each domain), (MDEON) [6], [7], [8], [9], [10].
with the collaboration of DRL modules. More specifically, for the
The network resource allocation in optical networks can be
inter-domain service requests, the high-level module obtains some
abstracted information from the low-level DRL modules and gen- described as the routing, modulation and spectrum allocation
erates the inter-domain RMSA decision for the low-level modules. (RMSA) problem. Even for the single domain EON, the RMSA
Then the low-level DRL module gives the intra-domain RMSA problem is known as NP-hard [11]. As a result, conventional
decision and feeds back to the high-level module. The proposed studies have focused on the design of various heuristic algo-
HRL framework preserves the autonomy of each single domain
rithms, aiming to improve the overall network performance [12],
while delivering effective overall network performance through the
cooperation of the high-level and low-level DRL modules. Simula- [13]. Classical algorithms include the K-shortest path rout-
tion results demonstrate that our proposed method outperforms ing [13], the distance-aware modulation [14], and the first-fit
previous approaches. spectrum allocation [15]. These heuristic solutions are based
Index Terms—Routing, modulation and spectrum allocation, on handcrafted rules that are supposed to lead to good per-
multi-domain elastic optical network, hierarchical reinforcement formance. However, the relationship between the rules and the
learning. final performance is indirect, and therefore, it is difficult to find
the optimal solution, especially in the complicated EON envi-
I. INTRODUCTION ronment. In recent years, deep reinforcement learning (DRL)
HE rising demand of new network services, such as has been applied in dynamic elastic optical networks to tackle
T cloud computing, video conferencing and live broad-
cast, poses great challenges to the bearing communication net-
the RMSA problem [16], [17], [18], [19]. Different from the
heuristic solutions, the DRL-based approaches learn directly
work [1]. Comparing with the conventional wavelength division from the feedback of the network performance. As a result, DRL
has been recognized as a promising method to approach the
Manuscript received 11 October 2022; revised 14 December 2022; accepted optimal resource allocation strategy in the complicated EON
27 December 2022. Date of publication 9 January 2023; date of current version
17 April 2023. This work was supported in part by the National Natural environment. The performance of DRL-based approaches is
Science Foundation of China under Grant 62006084 and in part by Guangdong demonstrated better than classical heuristic algorithms (e.g.,
Basic and Applied Basic Research Foundation under Grant 2020A1515111110. K-shorted path routing and first-fit spectrum allocation), and
(Corresponding author: Yue-Cai Huang.)
Liufei Xu is with the School of Physics and Telecommunication Engi- various studies have been done to explore their advantages [20],
neering, South China Normal University, Guangzhou 510006, China (e-mail: [21], [22]. By far, most DRL-based RMSA studies have been
[email protected]). applied to single-domain scenarios, where a controller is aware
Yue-Cai Huang, Yun Xue, and Xiaohui Hu are with the School of Elec-
tronics and Information Engineering, South China Normal University, Fos- of the whole network conditions.
han 528200, China (e-mail: [email protected]; [email protected]; As mentioned earlier, MDEON should be considered in prac-
[email protected]). tical situations [23]. In the MDEON, to ensure the autonomy of
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JLT.2023.3235039. each domain, the domain controller does not disclose detailed
Digital Object Identifier 10.1109/JLT.2023.3235039 intra-domain information to other domains. So the coordination

0733-8724 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2277

between the domain controllers is challenging for the inter- them are based on a hierarchical control architecture. Casellas
domain service requests [8], [24]. Currently, using DRL to solve et al. [27] introduced the IDEALIST project, which includes
the service request configuration problem of the MDEON has the design and implementation of a hierarchical control plane
begun to arouse researchers’ interest [25], [26]. Li et al. [25] for the multi-domain EON, while the detailed RMSA solutions
proposed DeepCoop, where each domain is autonomously man- are not exploited. Based on the hierarchical control architec-
aged by one DRL agent. For inter-domain service requests, ture, various heuristic RMSA strategies have been proposed.
DeepCoop first uses a domain-level path computation element Chen et al. [28] demonstrated a multi-domain EON testbed and
(PCE) to obtain the sequence of the domains that the request goes the RMSA is based on the conventional shortest path routing
through, and then the agents take charge of the intra-domain or random fit. Hirota et al. [23] defined a score for multiple
resource provision. During the provision, adjacent agents ex- routes based on the number of available frequency slots and the
change information and cooperate with each other. This work number of fibers and then chose the route with the maximum
extends DRL-based RMSA to multi-domain cases. However, score. Chen et al. [29] introduced a spectral cut to measure the
the domain-level path computation element (PCE) is still based level of spectrum fragmentation and proposed a fragmentation-
on heuristic rules, and the cooperation is confined between aware RSA by choosing the path with the minimum spectral
adjacent domains. To further exploit the advantages of DRL cut. Zhu et al. [30] assigned weights to measure the spectrum
in the MDEON, we propose here a hierarchical reinforcement usage and power consumption for the RMSA decisions and
learning framework for the multi-domain elastic optical network tried to find the one with the minimum weight. Le et al. [31]
to realize joint RMSA. checked the routes following the order of breadth-first search
The proposed hierarchical reinforcement learning (HRL) until the QoT and fragmentation requirements are satisfied.
framework consists of a high-level DRL module and multiple
Batham et al. [32] defined a so-called holding pathlength domain
low-level DRL modules, one for each domain. Low-level mod- factor (HPDF) based on three parameters: the holding time
ules take charge of intra-domain service requests in correspond- of the request, the pathlength of the route and the number of
ing domains. The high-level module and low-level modules
domains across the route. It served the traffic request in the
cooperatively handle the inter-domain service requests. Specifi- ascending order of HPDF. In general, the idea of the above
cally, the high-level module does not have detailed information heuristic RMSA strategies is to first define some metrics and
of each domain, but it obtains some abstracted information from
then optimize the problem based on these metrics. However, the
low-level DRL modules. Based on this limited information, the relationship between these user-defined metrics and the final
high-level module determines a sequence of domains that the performance is indirect, and therefore, it is difficult to find
inter-domain service request goes through. This sequence is
the optimal solution, especially in the complicated MDEON
given to the low-level DRL modules. Low-level DRL modules environment.
determine the intra-domain resource allocation strategies and
return the provision results to the high-level DRL module. The
proposed HRL framework preserves the autonomy of each single B. Machine Learning-Based RMSA
domain while delivering effective overall network performance
Machine learning has been exploited in the RMSA of the
through the cooperation of the high-level and low-level DRL
MDEON recently. Chen et al. [10], [33] proposed a distributed
modules.
collaborative learning approach for the MDEON. It calculated
We have conducted extensive simulations on the MDEON to
K shortest paths and selected the QoT guaranteed one by jointly
evaluate our proposal and its effectiveness is demonstrated in
minimizing the current resource utilization and predicted traffic
the following three aspects. (1) The proposed approach outper-
on all traversed inter-domain links. The traffic prediction and
forms classical heuristic algorithms and state-of-art DRL-based
QoT estimation were done by machine learning. Similar to these
approaches. (2) Cooperation between the high-level and low-
heuristic methods, the relationship between interested metrics
level modules improves the overall performance. (3) The pro-
such as the QoT and the final performance is indirect. As a
posed approach possesses good universality in different network
result, the optimization was done in an indirect way. Li et al. [34]
topologies and different traffic loads.
introduced DeepMDR, which is a deep learning-assisted control
The rest of this paper is organized as follows. Section II sur-
plane system to realize scalable path computation in multi-
veys the related work. Section III describes the problem formu-
domain packet networks. It first collected training samples
lation of the RMSA problem in the MDEON. The architecture
whose routes are obtained based on a genetic algorithm and
of our proposed approach is presented in Section IV. Section V
then used these samples to train a neural network to generate
compares the performance of our proposed approach with other
routes for new requests. Zhong et al. [26] proposed a control
state-of-the-art methods, and finally, Section VI concludes this
plane architecture that enables AI-based routing. It learns from
paper.
sparse historical route trajectories and trains a deep-learning
model that can directly return a feasible inter-domain route when
II. RELATED WORK being requested. In [26], [34], RMSA decisions are given by
A. Heuristic RMSA Strategies neural networks, while the training samples are obtained from
user-generated data or historical data. With such supervised
In the past decade, research has been done on the routing,
learning, the learned model can only fit the given data, instead
modulation and spectrum allocation in MDEONs. Most of
of optimizing the performance.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2278 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023

Different from the above solutions, the DRL-based ap- TABLE I


DISTANCE-AWARE MODULATION FORMATS
proaches learn directly from the feedback of the network and
improve the performance gradually via interactions with the
network. Li et al. [25] designed DeepCoop to realize service
provisioning in MDEONs with cooperative deep reinforcement
learning agents. However, the domain-level RMSA is still rule-
based, which limits the advantage of DRL in the MDEON.
Therefore, we propose in this paper a hierarchical reinforcement
learning framework for the multi-domain elastic optical network
to realize joint RMSA.

III. PROBLEM FORMULATION


We consider
 a MDEON with N domains,
 modeled as a
  
graph G = G V , E , ∀i ∈ [1, N ], Ẽ . Here, Gi V i , E i
i i i

represents the topology of Domain i, where V i and E i are


the sets of nodes and fiber links in Domain i, and Ẽ is the
set of inter-domain links. In Domain i, each fiber link can Fig. 1. Overall operation principle of the DRL approach.
support N i frequency slots (FSs). For the inter-domain links,
each of them can accommodate N inter FSs. In Domain i, the In (1), B is the bandwidth demand in Gbps, W is the data
link from node u ∈ V i to node v ∈ V i is denoted by ei,i u,v . The rate of each FS in Gbps, m is bits per symbol for the chosen
inter-domain links between Domains i and j can be denoted modulation format, and 1 represents that one FS is used for the
u,v , where u ∈ V and v ∈ V are the border nodes at its
as ẽi,j i j
guard band. Upon receiving an intra-domain service request, we
two ends. No wavelength conversion and O/E/O conversions are first choose a path inside this domain and calculate the number
applied inside each domain, while O/E/O conversions are used of required contiguous FSs in the same way. With the number of
on both sides of each inter-domain link. Therefore, the spectrum required contiguous FSs determined, the spectrum resources are
continuity and contiguity constraints should be satisfied inside allocated accordingly. If the allocation is successful, the service
each domain, while only the spectrum contiguity constraint request is accepted. Otherwise, the service request is blocked.
needs to be satisfied for the inter-domain links. The RMSA policy is expected to minimize the overall block-
Service requests randomly join the network. Each of them ing probability. As it is a challenging task for the EON, in this
can be modeled by a tuple T R{src, dst, B}, where src de- paper, the RMSA is conducted by deep reinforcement learning.
notes the source node, dst denotes the destination node, and Generally, DRL is carried on by the interactions of an agent and
B is the bandwidth demand in Gbps. Service requests in the an environment. As illustrated in Fig. 1, the agent observes the
MDEON can be classified into intra-domain service requests environment through a state and it emits an action according to
and inter-domain service requests. For intra-domain service the state and its internal policy. The environment receives this
requests, the pairs of the source node src and the destination action, moves to a new state and releases a reward. The agent
node dst are randomly generated from the nodes in the same receives the reward and adjusts its policy. The objective of the
domain, i.e., {src ∈ V i , dst ∈ V i , ∀i ∈ [1, N ], src = dst}. For learning is to maximize the long-term cumulative reward.
inter-domain service requests, the source node src and the desti-
nation node dst are randomly generated from different domains,
IV. THE FRAMEWORK OF THE HRL
i.e., {src ∈ V i , dst ∈ V j , ∀i, j ∈ [1, N ], i = j}.
Upon receiving an inter-domain service request, an RMSA The inter-domain RMSA problem is challenging in the
scheme is needed, which first chooses a path (i.e., link sequence) MDEON. Generally, each domain is autonomously managed,
that connects the source and the destination. This path consists of so that it does not disclose detailed intra-domain information
intra-domain paths and inter-domain links. Due to the introduc- to other domains. Instead, we allow them to communicate with
tion of O/E/O conversions between domains, different modula- a global controller using limited abstracted information. The
tion formats and spectra can be used in different domains and for global controller coordinates with all domain controllers to solve
different inter-domain links. For each domain or inter-domain the inter-domain RMSA problem. By doing this, we break the
link, the modulation format selection and spectrum allocation inter-domain RMSA problem into two mutually affected sub-
follow the same rule. The path distance inside each related problems: 1) finding a sequence of domains and inter-domain
domain or each inter-domain link is first obtained. Then, the links that connect the source domain and the destination domain,
modulation format is determined [35] according to Table I. Here, and 2) choosing a feasible RMSA scheme for each domain in
we consider four modulation formats: BPSK, QPSK, 8-QAM the above sequence. According to this design, we introduce
and 16-QAM, and the bits per symbol denoted by m is 1, 2, 3, a hierarchical reinforcement learning framework to tackle the
and 4, respectively. Finally, the number of required contiguous inter-domain RMSA problem, where a high-level DRL module
FSs is calculated by, to address sub-problem 1) and multiple low-level DRL modules
to address sub-problem 2). In this section, we first give an
Nsrc,dst = B/mW + 1 . (1) overview of the configuration for inter-domain service requests
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2279

Fig. 2. Overall architecture of the hierarchical reinforcement learning frame-


work.

with the HRL-based framework. Then we explain in detail the


high-level DRL module, the low-level DRL modules, and their
training processes.

A. Framework Overview
Fig. 3. Example on constructing the virtual topology (fiber distance in kilo-
The overall architecture of the HRL is shown in Fig. 2, meters).
which utilizes the hierarchical control plane for the inter-domain
service provisioning. Each domain has a low-level agent which
interacts with its corresponding domain environment and makes
intra-domain RMSA actions. All low-level agents communicate high-level state and selects a proper sequence consisting of
with a shared high-level agent. Meanwhile, as shown in Fig. 2, virtual nodes (domain i in the physical topology denoted as Di )
the physical topology of all domains can be abstracted to a high- and inter-domain links. For instance, the following sequence is
level virtual topology, by simplifying each domain as a virtual chosen, D1 → ẽ1,3 3 3,2
6,1 → D → ẽ4,3 → D .
2

node and connecting the virtual nodes by inter-domain links. The This sequence is sent to the low-level agents, and the ingress
high-level agent also interacts with this high-level virtual EON node or the egress node of each domain is obtained from the se-
environment which carries the abstracted information from all lected inter-domain links, therefore, the source-destination pair
domains and makes the RMSA action in the virtual topology. for each intra-domain RMSA can be determined. For Domain  1,

The configuration of inter-domain service requests is as the egress node is v61 , and the source-destination pair is v31 , v61 .
3 3
follows. Upon each service request coming, the high-level For Domain 3, the ingress and egress  3nodes
3
 are v1 and v4 , and
agent gets the current high-level state from the virtual topology then the source-destination pair is v1 , v4 . For Domain 2, the
and from the low-level agents. It then emits a high-level action, ingress node is v32 . Since this node is the same as the destination
(i.e., selecting a sequence consisting of virtual nodes and node, no intra-domain RMSA is needed for this domain. With
inter-domain links from pre-defined candidate sequences.) the source-destination pairs given, the candidate paths of the
It then coordinates with the low-level agents to determine intra-domain RMSA scheme in each domain can be calculated,
the ingress/egress nodes of the inter-domain links in each with the bandwidth of the request. Then the low-level agents
related domain. These ingress/egress nodes are defined as execute the intra-domain RMSA scheme separately.
the source/destination nodes for the subsequent intra-domain
RMSA. Then low-level agent gets the current low-level state
B. High-Level DRL Module
within its domain and emits a low-level action, i.e., RMSA
decision in each of the corresponding domains. Finally, the As we have explained in the previous section, the high-level
whole inter-domain lightpath can be set up accordingly for agent needs to select a virtual path for each inter-domain service
the request. Based on the setting up results, the reward for the request. Hence, its action not only affects the choice of the
low-level agents and the high-level agent is designed. Note that virtual path but also affects the service provisioning in each
instead of making their decisions independently, the low-level chosen domain. To this end, we design the state of the high-level
agents actually cooperate with the high-level agent to improve agent to include some connectivity information about the chosen
the performance of inter-domain service provisioning. domains and formulate the reward functions to consider the
For example, for the MDEON shown in Fig. 3, the physical feedback from the chosen domains. The high-level DRL module
topology can be abstracted to a high-level virtual topology. is detailed as follows.
Assume there is an inter-domain service request with 50 Gbps 1) High-Level State sH : The high-level state sH is a vec-
from the source node v31 in Domain 1 to the destination node v32 tor representing the high-level environment information, which
in Domain 2. First of all, the high-level agent gets the current includes the information of the incoming service request and

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2280 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023

also the current network status. For the incoming service re- TABLE II
HIGH-LEVEL REWARD
quest, knowing the source domain and the destination domain
in the virtual topology, K H candidate virtual paths can be
obtained. As mentioned in Section IV-A, each path is represented
by a sequence of virtual nodes and inter-domain links. For
the example shown in Fig. 3. The source domain is D1 and
the destination domain is D2 . Then, we can obtain 6 candidate
virtual paths sorted in ascending order by the summation of
inter-domain link distance:
r virtual path 1: D1 → ẽ1,2 → D2
2,1
r virtual path 2: D1 → ẽ1,2 → D2 the above-mentioned virtual path as an example, its feasible
6,2
r virtual path 3: D1 → ẽ1,3 → D3 → ẽ3,2 → D2 probability is P D1 |in = v31 , out = v61 . Recall that no O/E/O
6,1 4,3
r virtual path 4: D1 → ẽ1,3 → D3 → ẽ3,2 → D2 conversions are applied for the intra-domain transmissions,
6,1 1,4 therefore, spectrum contiguity and continuity constraints should
r virtual path 5: D1 → ẽ1,3 → D3 → ẽ3,2 → D2
5,3 4,3 be both satisfied during the spectrum allocation.
r virtual path 6: D1 → ẽ1,3 → D3 → ẽ3,2 → D2 2) High-Level Action aH : As mentioned in Section IV-B1,
5,3 1,4
For each virtual path k, k ∈ {1, 2, . . . , K H }, we define a fea- for the incoming service request, with the source domain and the
sible probability Pk , which will be detailed later. The high-level destination domain given in the virtual topology, K H candidate
state is defined by, virtual paths can be determined. A high-level action aH is to
choose one virtual path from the K H candidate paths. Then, for
sH = [Esrc , Edst , B, P1 , P2 , . . . , PK H ]. (2)
each inter-domain link on the virtual path, the modulation format
In the above definition, Esrc and Edst are one-hot vectors, can be determined by the link distance according to Table I. Next,
representing the source domain and the destination domain of the with the modulation format chosen, the number of FSs needed
incoming service request. Each element in Esrc /Edst represents for each inter-domain link can be calculated by (1). Finally,
a node in the virtual topology, where the source/destination node for each inter-domain link, whether it can support the service
is given 1 and other nodes are 0 s. B denotes the bandwidth request with enough available FSs should be checked based on
demand in Gbps. Pk , k ∈ {1, 2, . . . , K H }, is the feasible prob- the spectrum contiguity constraint. Besides, the intra-domain
ability of virtual path k. transmission will be handled by the low-level agents in all related
Recall that each virtual path is represented by a sequence of domains restrained by the spectrum continuity and contiguity
virtual nodes and inter-domain links. The feasible probability constraints (detailed in Section IV-C). If the whole virtual path
of the virtual path is defined as the product of the feasible can be successfully set up, the service request is accepted.
probability of the related virtual nodes and inter-domain links. Otherwise, it will be blocked.
For example, for the virtual path 3: D1 → ẽ1,3 3 3,2
6,1 → D → ẽ4,3 → 3) High-Level Reward rH : The high-level reward rH is a
D2 , the feasible probability is calculated by, scalar returned to the high-level agent, which evaluates the
  instantaneous benefit of the action aH taken. The objective of
P3 = P D1 |in = v31 , out = v61 × Pẽ1,3 inter-domain service provisioning is to minimize the blocking
6,1
 3 3 3
 probability of these service requests. Therefore, the reward for
× P D |in = v1 , out = v4 × Pẽ3,2
4,3 the high-level agent rH is relevant to whether the high-level
 2 2 2
 agent chooses an appropriate virtual path and whether the related
× P D |in = v3 , out = v3 . (3)
low-level agents choose proper RMSA schemes in correspond-
In the above equation, for each inter-domain link, if it satisfies ing domains. Generally, if the inter-domain service request can
the bandwidth demand of the inter-domain service request, its be successfully set up, the high-level agent will receive a positive
feasible probability is set to be 1, otherwise, it is set to be reward, otherwise, it receives a negative reward.
0. Recall that O/E/O conversions are applied on the border The detailed reward setting is given in Table II. If the incoming
nodes, for the inter-domain links, only the spectrum contiguity service request is successfully accepted, we consider three cases.
constraint needs to be satisfied during the spectrum allocation. (1) The feasible probability of the chosen virtual path is the
The feasible probability of the related virtual node Di is largest among all virtual paths, i.e., PaH = max Pk , the reward
defined as, is set to be a large positive value. (2) The feasible probability of
 
P Di |in = vm i
, out = vni = Km,ni,feasible i
/Km,n , (4) the chosen virtual path is the smallest among all virtual paths,
i.e., PaH = min Pk , the reward is set to be a small positive
i
where vm is the ingress node or source node in Domain i, value. (3) The feasible probability of the chosen virtual path
i
vn is the egress node or the destination node in Domain i, is moderate among all virtual paths, i.e., PaH = max Pk and
i
Km,n represents the number of candidate intra-domain paths PaH = min Pk , the reward is set to be a moderate positive value.
i
connecting vm and vni , and Km,n
i,feasible
are the number of candidate If the incoming service request is blocked, we consider two
i
paths satisfying the service request. If vm and vni are the same cases. (1) The feasible probability of the chosen virtual path is the
nodes, i.e., m = n, the feasible probability is set to be one. This smallest among all virtual paths, i.e., PaH = min Pk , the reward
definition reflects the congestion level between the ingress node is set to be a large negative value. (2) The feasible probability of
and the egress node in the physical topology. Take Domain 1 in the chosen virtual path is not the smallest among all virtual paths,
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2281

i.e., PaH = min Pk , the reward is set to be a not large negative


value. The reasons for the above reward setting are as follows.
Feasible probability reflects the congestion level of the virtual
path. A larger feasible probability means that the connectivity
between the ingress node and the egress node inside the relevant
domains is better and therefore future request is less likely to
be blocked. As a result, we prefer the high-level agent to select
the virtual path with a large feasible probability. If the virtual
path with a larger feasible probability is chosen, we give a larger
reward.
Fig. 4. Example to show three candidate paths from v31 to v61 with the usage
of the frequency slots on the candidate paths.
C. Low-Level DRL Module
TABLE III
For an incoming inter-domain service request, firstly, the high- THE VALUE OF THE SPECTRUM AVAILABILITY DISTRIBUTION FEATURES FOR
THE EXAMPLE SHOWN IN FIG. 4
level agent chooses a virtual path as introduced in Section IV-B.
Then the related low-level agents select the appropriated RMSA
schemes based on the ingress and egress nodes from the virtual
path. Notice that low-level agents independently collect the
states of their domains and then individually make the RMSA
schemes for the service request. In the following, we introduce
the DRL module design for Domain i, which can be extended
to all domains in a similar way. TABLE IV
1) Low-Level State si : In Domain i, the low-level state si LOW-LEVEL REWARD FOR INTER-DOMAIN SERVICE REQUESTS
is a vector representing the low-level environment status of the
corresponding domain, defined as,
 i 
si = Esrc i
, Edst , F1i , F2i , . . . , FK
i
i . (5)

From the virtual path given by the high-level agent, the ingress
node and egress node in this domain can be obtained as the source
node and destination node for the intra-domain transmission.
i
The source node and the destination node are denoted by Esrc 2) Low-Level Action ai : In Domain i, the low-level action
i i i
and Edst , which are one-hot vectors. Each element in Esrc /Edst a is to select one path from K i candidate paths according
i

represents a node in Domain i, where the source/destination to the source node and the destination node of the current
node is given 1 and other nodes are 0 s. For example in Fig. 3, domain. Next, the modulation format is determined by the path
for Domain 1, the source node is v31 and the destination node is distance according to Table I. Then, with the modulation format
v61 , so Esrc
1 1
= [0, 0, 1, 0, 0, 0] and Edst = [0, 0, 0, 0, 0, 1]. chosen, the number of FSs needed can be calculated by (1).
With the source node and destination node determined in Finally, whether the selected path can support the service request
Domain i, K i shortest candidate paths can be calculated. For dif- with enough available FSs should be checked based on the
ferent domains, the number of candidate paths can be different. spectrum continuity and contiguity constraints. For the intra-
For each candidate path k = 1, 2, . . . , K i , we define Fki , which domain service request, if there are enough available FSs, the
includes 7 features [19] of the spectrum availability information. service request is accepted by allocating the spectrum with the
r Nb , the total number of required FSs by the service request, first-fit rule. Otherwise, it will be blocked. For the inter-domain
which is calculated from (1) and Table I service request, if there are enough available FSs in all related
r NFS , the total number of available FSs domains and inter-domain links, the service request is accepted
r NFSB , the total number of available FS-blocks by allocating the spectrum with the first-fit rule. Otherwise, it
r N  , the total number of available FS-blocks satisfying will be blocked.
FSB
the service request 3) Low-Level Reward ri : In Domain i, the low-level reward
r Istart , the starting index of the first available FS-block (first- i
r is a scalar returned to the low-level agent i, which evaluates
fit rule related) satisfying the service request the instantaneous benefit of the action ai taken. For actions
r Sfirst , the size of the first available FS-block (first-fit rule handling the inter-domain service requests, the detailed reward
related) satisfying the service request setting is given in Table IV. If the incoming service request is
r SFSB , the average size of the available FS-blocks successfully accepted, the reward is set to be a positive value. If
For the example in Fig. 4, the values of the spectrum availabil- the incoming service request is blocked, we consider the effect of
ity information Fki , ∀k ∈ {1, 2, . . ., K i } are given by Table III. the high-level action in three cases. (1) The feasible probability
In the table, when Istart = −1 and Sfirst = −1, it means the of the chosen virtual path by the high-level agent is the largest
first available FS-block (first-fit rule related) cannot satisfy the among all virtual paths, i.e., PaH = max Pk , the reward is set
service request. Finally, for K i candidate paths, there are 7 × K i to be a large negative value. (2) The feasible probability of
features for the spectrum information. the chosen virtual path is the smallest among all virtual paths,
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2282 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023

i.e., PaH = min Pk , the reward is set to be a small negative where 0 < γ < 1 is the discount factor, and Gt is called the
value. (3) The feasible probability of the chosen virtual path return. Then, the value function V (st ) has the following rela-
is moderate among all virtual paths, i.e., PaH = max Pk and tionship with its n-step value function V (st+n ),
PaH = min Pk , the reward is set to be a moderate negative value. n−1
The reasons for the above reward setting in the case of block- V (st ) = E γ i rt+i + γ n V (st+n ) . (7)
ing are as follows. The feasible probability of the virtual path i=0
reflects the congestion level of the virtual path. A larger feasible
probability means that the connectivity between the ingress node Therefore, n−1i=0 γ rt+i + γ V (st+n ) can be regarded as an
i n

and the egress node inside the relevant domains is better and estimation of the value function. As a result, during the training,
therefore it is easier for the low-level agent to choose an action it is used as the learning target,
that accepts the service. In this situation, if the service request is n−1
finally blocked, we give a large penalty to the low-level agent. Target V (st ; θv ) = γ i rt+i + γ n V (st+n ; θv ). (8)
On the contrary, a smaller feasible probability means that the i=0
connectivity between the ingress node and the egress node inside For the state-action pair (st , at ), the advantage function is
the relevant domains is worse and therefore it is more difficult for defined to measure how much better the actually chosen action
the low-level agent to choose an action that accepts the service. is than average actions. The advantage function, therefore, can
In this situation, if the service request is finally blocked, we give be defined to be the difference between the target value function
a small penalty to the low-level agent. (measuring the return of the actually chosen action given by
Intra-domain service requests are only dealt by the corre- the current policy π(at |st ; θ)) and the estimated value function
sponding low-level agent. The state representation and actions V (st ) (measuring the average return of all possible actions). The
are the same as that for the inter-domain service request as advantage function is given by,
introduced in Section IV-C1 and Section IV-C2. The reward
n−1
setting is different since it does not need to consider the actions
of the high-level agent. Therefore, the reward is simply set to be A (st , at ; θ, θv )|n = γ i rt+i
i=0
1 if it is accepted and −1 if it is blocked.
+ γ n V (st+n ; θv ) − V (st ; θv ). (9)
In A3C, each local learner learns asynchronously as follows.
First, the local learner gets parameters from the global learner.
D. The Training Process of the HRL Then it collects T consecutive samples (st , at , rt ), t = {t0 , t0 +
1, . . . , t0 + T − 1} and stores them to an episode buffer, during
The A3C algorithm is used for the training of the high-level the interaction with its environment. These samples then are used
agent and the low-level agents. The high-level agent and the for the training. It computes the advantage function for each of
low-level agents collaborate during the training. For the sake the state-action pairs from the samples. Each advantage function
of readability, we first introduce the background of the A3C uses the longest possible n-step return from this episode. Specif-
algorithm which can be used for the training of one agent in ically, the first state uses its T -step return; the second state uses
general without collaboration. Then, we explain the detailed its (T − 1)-step return, and the last state uses its one-step return.
collaborations of the agents. After that, the gradients are accumulated. Notice that the value
1) Background (The A3C Algorithm): Each agent includes a network is updated with the objective to minimize the mean
number of local learners, interacting with separate environment squared error of the estimated value function V (st ; θv ) and the
copies. In addition, each agent has a global learner, which estimated target value function, i.e., the mean squared error of
updates its parameters according to the results learned by local the advantage function. Therefore the accumulated gradient is
learners. For one learner, it consists of an actor, which outputs given by (10). The policy network is updated with the objective
the action according to the input state, and a critic, which to maximize the average reward per time-step, with its gradient
measures the goodness of the input state. The actor and the critic given by (11), according to [36].
are parameterized by two neural networks, called the policy
t0 +T −1
network and the value network, respectively. For the global
learner, the parameters of the actor and the critic are denoted grad(θv ) = ∇θv A (st , at ; θ, θv )|T −(t−t0 ) 2 , (10)
as Θ and Θv . For one local learner, they are denoted as θ and θv . t=t0

For the local learner, their output are π(at |st ; θ) and V (st ; θv ), t0 +T −1
respectively. grad(θ) = ∇θ log π(at |st ; θ)
The value function V (st ) is defined to be the expected cumu- t=t0
lative discounted reward starting from state st ,
× A (st , at ; θ, θv )|T −(t−t0 ) + β∇θ H (π(st ; θ)).
(11)
H (π(st ; θ)) is the entropy of the policy distribution [36]. β
 ∞ controls the strength of the entropy regularization term.
V (st ) = E [Gt ] = E γ i rt+i , (6) These gradients are then transmitted to the global learner to
i=0 update its parameters. Specifically, the parameters for the value
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2283

TABLE V
NOTATIONS FOR THE PARAMETERS OF THE HIGH-LEVEL AGENT AND THE Algorithm 1: Learning Procedures of One MDEON Envi-
LOW-LEVEL AGENT i ronment Copy.

network and the policy network are updated by the mini-batch


(batch size T ) gradient descent and ascent given by (12) and
(13), respectively.

Θv ← Θv − η · grad(θv ), (12)
Θ ← Θ + α · grad(θ). (13)

η > 0 and α > 0 are the learning rates.


2) High-Level Agent and Low-Level Agents Collaboration:
In our proposed hierarchical reinforcement learning framework,
the A3C algorithm is applied to the high-level agent and the low-
level agents in a collaborative way. As mentioned, each agent
includes a number of local learners, interacting with separate
environment copies. Notice that, for one specific environment
copy, there is one local learner of the high-level agent and one
local learner of each low-level agent learning collaboratively
in the same environment copy. The learning procedures for
one environment copy are given by Algorithm 1. The specific
parameter notations are given in Table V.
The MDEON environment copy and the relevant local learn-
ers are first initialized (line 1 − 5). There are two types of
service requests. One is the inter-domain service request and the
other is the intra-domain service request. When an inter-domain
service request is coming, the high-level agent gives an action
representing a virtual path (line 8 − 10). With the given virtual
path, each related low-level agent gives an action representing
an intra-domain RMSA decision (line 11 − 14). Then, these
actions are applied to the MDEON environment (line 15). After
that, the high-level reward is set and the sample is stored in
its buffer DH . If the buffer size reaches a pre-defined value
T , the training is conducted (explained later). After training,
the buffer is emptied and the parameters of this local learner
are refreshed by the global learner (line 16 − 21). For each
related low-level agent, the sampling and training procedures are
similar (line 22 − 28). When an intra-domain service request is
coming, the related low-level agent gives an action representing
an intra-domain RMSA decision, and applies it to the EON
environment. After that, the sampling and training procedures
are similar to the cases when handling inter-domain service
requests (line 29 − 38). The training procedures are given in line
39 − 44. Firstly, the algorithm computes the advantage function
for each of the state-action pairs from the samples. Then, the
gradients of the critic and the actor are accumulated by (10)
and by (11). Finally, the parameters of the global learner are
updated according to (12) and (13).

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2284 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023

TABLE VI use a softmax output for the policy network and a linear output
COMMON PARAMETERS USED IN ALL SIMULATIONS
for the value network, with all non-output layers activated by
ReLU [36].
Notice that, at the beginning of the simulation, the entire
elastic optical network is empty. In order to exclude the initial
transient phase, we have a warm-up period of 3,000 service
requests, and then calculate the blocking probability for every
1,000 requests.

B. Performance Evaluations With Comparison Methods


To investigate the effectiveness of our approach, we have con-
ducted simulations in large-scale multi-domain elastic optical
networks, which consist of three NSFNet domains as shown
in Fig. 5(a), and the results are shown in Fig 5(b). The curve
for KSP+KSP in Fig. 5(b) represents the average blocking
probability of one million arrivals, after a warm-up of 3,000
arrivals. As the traffic load is unchanged in this scenario, for
the KSP+KSP, the blocking probability is stable. The curves for
these DRL-based methods in Fig. 5(b) show blocking probabil-
ities smoothed over a 1000-length window during the training.
V. SIMULATION RESULTS AND ANALYSIS From Fig. 5(b), we can observe that, after a short training
In this section, we conduct a number of simulations to evaluate time, (e.g., 5 × 106 arrivals), all DRL-based methods, except
the performance of our proposed approach. We compare the DRL+KSP, outperform the heuristic KSP+KSP and PCE+KSP
performance and generalization capability of our method with methods. This reveals the effectiveness of the DRL-based ap-
respect to state-of-the-art solutions in various scenarios. For the proaches. Furthermore, our proposed approach gives better
sake of simplicity, we name the proposed approach “DRL+DRL” performance than the other five methods in reducing the final
(i.e., DRL in both high-level and low-level modules). The com- blocking probabilities. Here, we define the blocking probability
pared approaches include the following: reduction of our method over a compared method as
r KSP+KSP: The classical heuristic KSP is used for both the
BP(compared method) − BP(our method)
high-level module and the low-level modules. ,
r PCE+KSP: PCE proposed in [25] is used for the high-level BP(compared method)
module and KSP is used for the low-level modules.
r DRL+KSP: DRL is used for the high-level module and where BP(·) denotes the final blocking probability of one
KSP is used for the low-level modules. method. Then, the blocking probability reduction of our ap-
r KSP+DRL: KSP is used for the high-level module and proach over KSP+KSP, PCE+KSP, DRL+KSP, KSP+DRL,
DRL is used for the low-level modules. PCE+DRL are 90.67%, 95.14%, 91.57%, 41.67% and 78.79%,
r PCE+DRL: PCE is used for the high-level module and respectively. This demonstrates the effectiveness of our pro-
DRL is used for the low-level modules. posed approach, which introduces DRL into the high-level con-
trol plane and the low-level control plane.
A. Simulation Setup
If not specified otherwise, the common parameters used in all C. Effect of the Arrival Rates
the simulation scenarios are listed in Table VI for convenience. To demonstrate the performance of our approaches with
The values are chosen according to literature [16], [17] and our different traffic loads, we performed the simulations on the
experiences. The service requests are generated according to network shown in Fig. 5(a). In this subsection, we conducted
independent Poisson processes and the service time is exponen- the simulations over different arrival rates (i.e., 16, 19, 22, 25, 28
tially distributed. The traffic patterns for both the intra-domain arrivals per time unit). Figure 6 shows the average blocking
service requests and the inter-domain service requests follow probabilities (averaged over 3 × 107 ∼ 3.1 × 107 arrivals) for
uniform distributions, where the arrival rates are different. The cases with different arrival rates. We find that our proposed
bandwidth demand of each service request is generated in a approach has the best performance. In this simulation, the
uniform distribution for both the inter-domain and intra-domain average blocking probability reduction for arrival rates 16 to
service requests. 28 is 94.54% compared to KSP+KSP, 96.60% compared to
In terms of agent design, all agents have the same network PCE+KSP, 93.26% compared to DRL+KSP, 64.63% compared
architecture. Each agent consists of a policy network and a to KSP+DRL and 85.72% compared to PCE+DRL, respectively.
value network, which represents the actor and the critic in These simulation results demonstrate that the proposed method
Section IV-D1, with parameters given in Table VI. Besides, we possesses good universality owing to different traffic loads.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2285

Fig. 5. Performance evaluations of the proposed method with comparison methods.

of the virtual path. Specifically, KSP+DRL regards the do-


mains as virtual nodes and ignores the intra-domain informa-
tion; PCE+DRL considers some intra-domain information by
including the spectrum availability around the egress/ingress
nodes; DRL+DRL considers more intra-domain information
by the feasible probability (defined in Section IV-B1), which
reflects the connectivity between the ingress node and egress
node. As a result, if the network topology of each domain is
more complicated, our proposed method can extract more infor-
mation, which is beneficial for the RMSA decision. Therefore,
our method has a greater advantage when the domain topology is
more complicated. Comparing the topology shown in Fig. 7(c),
Fig. 7(a), and Fig. 5(a), the domain-level topology is becoming
more and more complex, and therefore, the advantage of our
Fig. 6. Blocking probabilities over different arrival rates for different methods. proposal becomes more and more apparent.

E. Effect of Varying Inter-Domain Links


To investigate the performance in cases with different num-
D. Effect of Different Network Topologies bers of inter-domain links, we conducted simulations for the
following two cases:
To demonstrate the performance of our approaches with r Link-322: The numbers of inter-domain links are 3, 2 and
different network topologies, we have conducted three sets of
2, as shown by the topology in Fig 5(a).
simulations: r Link-333: The numbers of inter-domain links are 3, 3 and
1) 3-domain NSFNet (Fig. 5(a)), results shown in Fig. 5(b)
3 as shown by the topology in Fig 8(a).
2) 4-domain EON (Fig. 7(c)), results shown in Fig. 7(d)
Figure 8(b) shows the average blocking probabilities of the
3) 6-domain EON (Fig. 7(a)), results shown in Fig. 7(b)
inter-domain requests for different approaches. We find that our
For Simulations 2) and 3), we have similar observations as
proposed approach DRL+DRL has better performance than the
Simulation 1), that our proposed method DRL+DRL is the
compared KSP+DRL for both cases. Besides, with more inter-
best. Besides, we also find the performance improvement of
domain links, the benefit of adopting the proposed approach
DRL+DRL over the KSP+DRL and PCE+DRL is large for
becomes greater.
Simulation 1), while the improvement is smaller for Simulation
2) and even smaller for Simulation 3). The possible reasons are
as follows. For the 4-domain EON topology in Simulation 2) F. Effect of the Cooperation
and the 6-domain EON topology in Simulation 3), the domain As discussed in Section IV, the high-level module and the low-
environment is less complicated than the 3-domain NSFNet level modules cooperatively deal with the inter-domain service
topology. If we compare the following three methods: requests. This cooperation is realized by the reward design for
KSP+DRL, PCE+DRL, and DRL+DRL, the main differences both the high-level module and low-level modules. Specifically,
are the routing for the high-level module, i.e., the selection as given by Table II, the reward for the high-level module reflects

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2286 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023

Fig. 7. Effect of different network topologies.

Fig. 8. Effect of varying inter-domain links.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
XU et al.: HIERARCHICAL REINFORCEMENT LEARNING IN MULTI-DOMAIN ELASTIC OPTICAL NETWORKS TO REALIZE JOINT RMSA 2287

the congestion level of low-level modules through the feasible


probability. Meanwhile, as given by Table IV, the reward for
each low-level module also reflects the goodness of actions from
the high-level module. As given by Table II and Table IV, the
rewards are set to be some discrete values, so we name this case
as “DRL+DRL with discrete reward” in this subsection.
To validate the effectiveness of this cooperation through re-
ward setting. We have conducted simulations with the following
three kinds of reward settings.
r DRL+DRL with simple reward: the reward is always set
to be 1 if the request can be successfully accepted and
−1 otherwise, for both high-level module and low-level
modules. In this case, no cooperation is reflected in the
reward.
r DRL+DRL with discrete reward: the reward is set as given Fig. 9. Blocking probabilities of inter-domain service requests for the network
in Fig. 5(a).
by Table II and Table IV. In this case, some level of
cooperation is realized by the reward design as mentioned.
r DRL+DRL with continuous reward: The reward for the more nuanced the cooperation, the better the performance can
high-level module is set as given by Table II except for be.
the following case. When the request is accepted and
the feasible probability of the chosen virtual path PaH is G. Discussion
neither equal to max Pk nor equal to min Pk , the reward rH
As observed in the above simulations, during the training
is set to be the value satisfying the following proportional
period, the performance of DRL-based approaches may not
relationship,
be acceptable. To satisfy the users’ Quality-of-Service require-
max Pk − PaH max Pk − min Pk ments, the training process should not be applied directly to
= , (14)
3 − rH 3 − 0.8 the practical network. The digital twin [37], [38], [39] of the
elastic optical network can be used for the training of DRL-
and we have,
based approaches. Only when the performance of the trained
0.8 max Pk + 2.2PaH − 3 min Pk model becomes acceptable on the digital twin, the model can be
rH = . (15)
max Pk − min Pk deployed to the practical optical networks.
The reward for the low-level module is set as given by
Table IV except for the following case similarly. When the VI. CONCLUSION
request is blocked and the feasible probability of the chosen In this paper, we propose an HRL-based framework to re-
virtual path PaH is neither equal to max Pk nor equal to alize the joint RMSA in the MDEON. The HRL framework
min Pk , the reward ri is set to be the value satisfying the includes a high-level module and multiple low-level modules,
following proportional relationship, collaboratively dealing with the RMSA of the inter-domain
max Pk − PaH max Pk − min Pk service requests. The collaboration is reflected in the following
= , (16) three aspects. (1) The high-level DRL module receives some
3 − (−ri ) 3 − 0.8
abstracted intra-domain environment information from the low-
and we have, level DRL modules. (2) The high-level action determines the
−0.8 max Pk − 2.2PaH + 3 min Pk source and destination nodes of the low-level actions. (3) The
ri = . (17) reward design of the high-level DRL module considers the effect
max Pk − min Pk
of the low-level modules, and vice versa. Simulation results
For “DRL+DRL with continuous reward,” cooperation is demonstrate the effectiveness of our proposed approach.
realized by the reward design and the exchanged informa-
tion is more nuanced. REFERENCES
We compare the blocking probabilities of the inter-domain
[1] “Cisco visual networking index: Forecast and trends, 2017–2022,” 2019.
service requests as shown in Fig. 9. We observe that “DRL+DRL [Online]. Available: https://www.cisco.com/c/en_in/index.html
with discrete reward” and “DRL+DRL with continuous reward” [2] O. Gerstel, M. Jinno, A. Lord, and S. B. Yoo, “Elastic optical networking:
give better performance than “DRL+DRL with simple reward” A new dawn for the optical layer,” IEEE Commun. Mag., vol. 50, no. 2,
pp. s12–s20, Feb. 2012.
in reducing the final blocking probability. Specifically, the block- [3] M. Jinno, H. Takara, B. Kozicki, Y. Tsukishima, Y. Sone, and S. Matsuoka,
ing probability reduction over compared method is 59.18% “Spectrum-efficient and scalable elastic optical path network: Architec-
and 93.60%, respectively. These simulation results demon- ture, benefits, and enabling technologies,” IEEE Commun. Mag., vol. 47,
no. 11, pp. 66–73, Nov. 2009.
strate the effectiveness of the cooperation. We also find that [4] G. Zhang, M. De Leenheer, A. Morea, and B. Mukherjee, “A survey on
“DRL+DRL with continuous reward” give better performance OFDM-based elastic core optical networking,” IEEE Commun. Surveys
than “DRL+DRL with discrete reward”. This indicates that the Tuts., vol. 15, no. 1, pp. 65–87, Jan.–Mar. 2013.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.
2288 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 41, NO. 8, APRIL 15, 2023

[5] L. Velasco, L. Gifre, and A. Castro, “Brokered orchestration for end-to-end [23] Y. Hirota, H. Tode, and K. Murakami, “Multi-fiber based dynamic spec-
service provisioning across heterogeneous multi-operator networks,” in trum resource allocation for multi-domain elastic optical networks,” in
Proc. 19th Int. Conf. Transparent Opt. Netw., 2017, pp. 1–4. Proc. 18th Optoelectron. Commun. Conf. Held Jointly Int. Conf. Photon.
[6] G. Liu et al., “Hierarchical learning for cognitive end-to-end service Switching, 2013, pp. 1–2.
provisioning in multi-domain autonomous optical networks,” J. Lightw. [24] S. Li, B. Li, and Z. Zhu, “To cooperate or not to cooperate: Service function
Technol., vol. 37, no. 1, pp. 218–225, Jan. 2019. chaining in multi-domain edge-cloud elastic optical networks,” in Proc.
[7] Z. Zhu et al., “OpenFlow-assisted online defragmentation in single-/multi- Conf. Opt. Fiber Commun., 2022, pp. 1–3.
domain software-defined elastic optical networks,” J. Opt. Commun. Netw., [25] B. Li, R. Zhang, X. Tian, and Z. Zhu, “Multi-agent and cooperative deep
vol. 7, no. 1, pp. A7–A15, 2015. reinforcement learning for scalable network automation in multi-domain
[8] L. Sun, X. Chen, and Z. Zhu, “Multibroker-based service provisioning in SD-EONs,” IEEE Trans. Netw. Serv. Manag, vol. 18, no. 4, pp. 4801–4813,
multidomain SD-EONs: Why and how should the brokers cooperate with Dec. 2021.
each other,” J. Lightw. Technol., vol. 35, no. 17, pp. 3722–3733, Sep. 2017. [26] Z. Zhong, N. Hua, Z. Yuan, Y. Li, and X. Zheng, “Routing without rout-
[9] Y. Zong, C. Feng, Y. Guan, Y. Liu, and L. Guo, “Virtual network embedding ing algorithms: An AI-based routing paradigm for multi-domain optical
for multi-domain heterogeneous converged optical networks: Issues and networks,” in Proc. Conf. Opt. Fiber Commun., 2019, pp. 1–3.
challenges,” Sensors, vol. 20, no. 9, 2020, Art. no. 2655. [27] R. Casellas et al., “A control plane architecture for multi-domain elastic
[10] X. Chen, R. Proietti, H. Lu, A. Castro, and S. J. B. Yoo, “Knowledge- optical networks: The view of the IDEALIST project,” IEEE Commun.
based autonomous service provisioning in multi-domain elastic optical Mag., vol. 54, no. 8, pp. 136–143, Aug. 2016.
networks,” IEEE Commun. Mag., vol. 56, no. 8, pp. 152–158, Aug. 2018. [28] C. Chen et al., “Demonstration of openflow-controlled cooperative re-
[11] K. Christodoulopoulos, I. Tomkos, and E. A. Varvarigos, “Elastic band- source allocation in a multi-domain SD-EON testbed across multiple
width allocation in flexible OFDM-based optical networks,” J. Lightw. nations,” in Proc. The Euro. Conf. Opt. Commun., 2014, pp. 1–3.
Technol., vol. 29, no. 9, pp. 1354–1366, May 2011. [29] X. Chen, C. Chen, D. Hu, S. Ma, and Z. Zhu, “Multi-domain
[12] H. A. Dinarte, B. V. Correia, D. A. Chaves, and R. C. Almeida, “Routing fragmentation-aware RSA operations through cooperative hierarchical
and spectrum assignment: A metaheuristic for hybrid ordering selection in controllers in SD-EONs,” in Proc. Asia Commun. Photon. Conf., 2015,
elastic optical networks,” Comput. Netw., vol. 197, 2021, Art. no. 108287. pp. 1–3.
[13] B. C. Chatterjee, N. Sarma, and E. Oki, “Routing and spectrum allocation in [30] J. Zhu, X. Chen, D. Chen, S. Zhu, and Z. Zhu, “Service provisioning
elastic optical networks: A tutorial,” IEEE Commun. Surveys Tuts., vol. 17, with energy-aware regenerator allocation in multi-domain EONs,” in Proc.
no. 3, pp. 1776–1800, Jul.–Sep. 2015. IEEE Glob. Commun. Conf., 2015, pp. 1–6.
[14] A. Bocoi, M. Schuster, F. Rambach, M. Kiese, C.-A. Bunge, and B. [31] H.-C. Le, D. T. Hai, and N. T. Dang, “Development of dynamic QoT-
Spinnler, “Reach-dependent capacity in optical networks enabled by aware lightpath provisioning scheme with flexible advanced reservation
OFDM,” in Proc. Conf. Opt. Fiber Commun., 2009, pp. 1–3. for distributed multi-domain elastic optical networks,” in Proc. 10th Int.
[15] E. E. Moghaddam, H. Beyranvand, and J. A. Salehi, “Routing, spectrum Symp. Inf. Commun. Technol., 2019, pp. 210–215.
and modulation level assignment, and scheduling in survivable elastic op- [32] D. Batham and D. S. Yadav, “HPDST: Holding pathlength domain sched-
tical networks supporting multi-class traffic,” J. Lightw. Technol., vol. 36, uled traffic strategy for multi-domain optical networks,” Optik, vol. 222,
no. 23, pp. 5451–5461, Dec. 2018. 2020, Art. no. 165145.
[16] X. Chen, B. Li, R. Proietti, H. Lu, Z. Zhu, and S. J. B. Yoo, “DeepRMSA: [33] X. Chen, B. Li, R. Proietti, C.-Y. Liu, Z. Zhu, and S. J. Ben Yoo,
A deep reinforcement learning framework for routing, modulation and “Demonstration of distributed collaborative learning with end-to-end QoT
spectrum assignment in elastic optical networks,” J. Lightw. Technol., estimation in multi-domain elastic optical networks,” Opt. Exp., vol. 27,
vol. 37, no. 16, pp. 4155–4163, Aug. 2019. no. 24, pp. 35700–35709, 2019.
[17] Y.-C. Huang, J. Zhang, and S. Yu, “Self-learning routing for optical net- [34] D. Li, H. Fang, X. Zhang, J. Qi, and Z. Zhu, “DeepMDR: A deep-
works,” in Proc. Int. IFIP Conf. Opt. Netw. Des. Model., 2019, pp. 467–478. learning-assisted control plane system for scalable, protocol-independent,
[18] B. Tang, J. Chen, Y.-C. Huang, Y. Xue, and W. Zhou, “Optical network and multi-domain network automation,” IEEE Commun. Mag., vol. 59,
routing by deep reinforcement learning and knowledge distillation,” in no. 3, pp. 62–68, Mar. 2021.
Proc. Asia Commun. Photon. Conf., 2021, pp. 1–3. [35] W. Lu and Z. Zhu, “Dynamic service provisioning of advance reservation
[19] L. Xu, Y.-C. Huang, Y. Xue, and X. Hu, “Deep reinforcement learning- requests in elastic optical networks,” J. Lightw. Technol., vol. 31, no. 10,
based routing and spectrum assignment of EONs by exploiting GCN pp. 1621–1627, May 2013.
and RNN for feature extraction,” J. Lightw. Technol., vol. 40, no. 15, [36] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
pp. 4945–4955, Aug. 2022. in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
[20] B. Tang, Y.-C. Huang, Y. Xue, and W. Zhou, “Deep reinforcement learning- [37] J. Li, D. Wang, M. Zhang, and S. Cui, “Digital twin-enabled self-evolved
based RMSA policy distillation for elastic optical networks,” Mathematics, optical transceiver using deep reinforcement learning,” Opt. Lett., vol. 45,
vol. 10, no. 18, 2022, Art. no. 3293. no. 16, pp. 4654–4657, 2020.
[21] N. E. D. E. Sheikh, E. Paz, J. Pinto, and A. Beghelli, “Multi-band pro- [38] S. Cui, D. Wang, J. Li, and M. Zhang, “Dynamic programmable optical
visioning in dynamic elastic optical networks: A comparative study of a transceiver configuration based on digital twin,” IEEE Commun. Lett.,
heuristic and a deep reinforcement learning approach,” in Proc. Int. Conf. vol. 25, no. 1, pp. 205–208, Jan. 2021.
Opt. Netw. Des. Model., 2021, pp. 1–3. [39] D. Wang et al., “The role of digital twin in optical communication:
[22] B. Tang, Y.-C. Huang, Y. Xue, and W. Zhou, “Heuristic reward design Fault management, hardware configuration, and transmission simulation,”
for deep reinforcement learning-based routing, modulation and spectrum Comm. Mag., vol. 59, no. 1, pp. 133–139, 2021.
assignment of elastic optical networks,” IEEE Commun. Lett., vol. 26,
no. 11, pp. 2675–2679, Nov. 2022.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 29,2024 at 03:12:33 UTC from IEEE Xplore. Restrictions apply.

You might also like