0% found this document useful (0 votes)
4 views7 pages

DeepRSMA

The paper introduces DeepRMSA, a deep reinforcement learning framework designed for routing, modulation, and spectrum assignment (RMSA) in elastic optical networks (EONs). It employs deep neural networks to learn optimal online RMSA policies, utilizing modified training mechanisms to stabilize learning and improve performance over traditional heuristic methods. The framework aims to enhance dynamic lightpath provisioning by maximizing long-term cumulative rewards while effectively managing network resources.

Uploaded by

21bd1a1211ita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

DeepRSMA

The paper introduces DeepRMSA, a deep reinforcement learning framework designed for routing, modulation, and spectrum assignment (RMSA) in elastic optical networks (EONs). It employs deep neural networks to learn optimal online RMSA policies, utilizing modified training mechanisms to stabilize learning and improve performance over traditional heuristic methods. The framework aims to enhance dynamic lightpath provisioning by maximizing long-term cumulative rewards while effectively managing network resources.

Uploaded by

21bd1a1211ita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1

DeepRMSA: A Deep Reinforcement Learning


Framework for Routing, Modulation and Spectrum
Assignment in Elastic Optical Networks
Xiaoliang Chen, Member, IEEE, Baojia Li, Roberto Proietti, Hongbo Lu, Zuqing Zhu, Senior
Member, IEEE, S. J. Ben Yoo, Fellow, IEEE, Fellow, OSA
arXiv:1905.02248v2 [cs.NI] 15 May 2019

Abstract—This paper proposes DeepRMSA, a deep reinforce- EON can flexibly set up bandwidth-variable superchannels by
ment learning framework for routing, modulation and spec- grooming series of finer-granularity (e.g., 6.25 GHz) subcarri-
trum assignment (RMSA) in elastic optical networks (EONs). ers and adapting the modulation formats according to the QoT
DeepRMSA learns the correct online RMSA policies by pa-
rameterizing the policies with deep neural networks (DNNs) of lightpaths [2].
that can sense complex EON states. The DNNs are trained The flexible resource allocation mechanisms in EON, on
with experiences of dynamic lightpath provisioning. We first the other hand, make the corresponding service provisioning
modify the asynchronous advantage actor-critic algorithm and designs more complicated. To fully exploit the benefits of
present an episode-based training mechanism for DeepRMSA, such flexibilities and realize cost-effective EON, previous
namely, DeepRMSA-EP. DeepRMSA-EP divides the dynamic
provisioning process into multiple episodes (each containing the studies have intensively investigated the routing, modulation
servicing of a fixed number of lightpath requests) and performs and spectrum assignment (RMSA) problem for EON [3]. The
training by the end of each episode. The optimization target of authors of [4]–[6] first proposed integer linear programming
DeepRMSA-EP at each step of servicing a request is to maximize (ILP) models for solving the static RMSA problems, where
the cumulative reward within the rest of the episode. Thus, we all the lightpath requests are assumed to be known in prior.
obviate the need for estimating the rewards related to unknown
future states. To overcome the instability issue in the training of While the ILP models can provide the optimal solutions to
DeepRMSA-EP due to the oscillations of cumulative rewards, we the RMSA problems, they are proved to be N P-hard [4]
further propose a window-based flexible training mechanism, i.e., and are intractable for large-scale problems. In this context,
DeepRMSA-FLX. DeepRMSA-FLX attempts to smooth out the a number of heuristic or approximation algorithms have been
oscillations by defining the optimization scope at each step as a developed. In [4], Wang et al. proposed two algorithms,
sliding window, and ensuring that the cumulative rewards always
include rewards from a fixed number of requests. Evaluations namely, balanced load spectrum allocation and shortest path
with the two sample topologies show that DeepRMSA-FLX with maximum spectrum reuse, to minimize the maximum
can effectively stabilize the training while achieving blocking required spectrum resources in an EON accounting for the
probability reductions of more than 20.3% and 14.3%, when given traffic demand. The authors of [5] presented a simulated
compared with the baselines. annealing approach for determining the servicing order of
Index Terms—Elastic optical networks (EONs), Routing, mod- lightpath requests and applied the k-shortest path routing and
ulation and spectrum assignment (RMSA), Deep reinforcement first-fit (KSP-FF) scheme to calculate the RMSA solution for
learning, Asynchronous advantage actor-critic algorithm. each request afterward. In [6], [7], the authors investigated to
leverage genetic algorithms to realize joint RMSA optimiza-
I. I NTRODUCTION tions. A conflict graph based two-phase algorithm with proved
performance level was proposed in [8]. For more heuristic

T HE explosive growth of emerging applications (e.g.,


cloud computing) and the popular adoption of new
networking paradigms (e.g., the Internet of Things) are
RMSA designs, such as random-fit, exact-fit and most-used
spectrum assignment, readers can refer to [3].
Unlike static RMSA problems for which explicit optimiza-
demanding a new network infrastructure that can support tion models can be formulated, optimizing dynamic lightpath
dynamic, high-capacity and quality-of-transmission (QoT)- provisioning in EONs (i.e., dynamic RMSA problems) is more
guaranteed end-to-end services. Recently, elastic optical net- challenging. The dynamic arrivals and departures of lightpath
working (EON) has emerged as one of the most promising requests as well as the uncertainty of future traffic could
networking technologies for the next-generation backbone dramatically destabilize the EON state and thus deteriorate
networks [1]. Compared with the traditional fixed-grid (e.g., the efficiency of the optimizations based on the current state.
50 GHz) wavelength-division multiplexing (WDM) scheme, To cope with such dynamics, a few dynamic RMSA designs
have been reported lately, in addition to those that can be
X. Chen, R. Proietti, H. Lu and S. J. B. Yoo are with the Department of
Electrical and Computer Engineering, University of California, Davis, Davis, derived from the aforementioned static RMSA algorithms.
CA 95616, USA (Email: [email protected], [email protected]). The authors in [9] applied the multi-path routing scheme
B. Li and Z. Zhu are with the School of Information Science and and developed several empirical weighting methods taking
Technology, University of Science and Technology of China, Hefei, Anhui
230027, P. R. China (Email: [email protected]). into account path lengths, link spectrum utilization, and other
Manuscript received Dec. 8, 2018. features to realize state-aware dynamic RMSA. In [10], Yin et
2

al. investigated the spectrum fragmentation effect in dynamic To provision Rt , we need to compute an end-to-end routing
lightpath provisioning and proposed a fragmentation-aware path Po,d , determine a proper modulation format m to use for
RMSA algorithm to mitigate spectrum fragmentation. More QoT assurance, and allocate a number of spectrally contiguous
aggressive service reconfiguration approaches, e.g., spectrum FS’s (i.e., the spectrum contiguous constraint) on each link
defragmentation [11], [12], have also been proposed as com- along Po,d according to b and m. In this work, we assume
plements to normal RMSA algorithms to enable periodical that the EON is not equipped with the spectrum conversion
service consolidations but at the expense of high operational capability. Therefore, the spectra allocated on different fibers
costs. However, the existing works only apply fixed RMSA to Rt must align (i.e., the spectrum continuous constraint).
policies regardless of the time-varying EON states or rely We adopt the impairment-aware model in [19] to decide the
on simple empirical policies based on manually extracted modulation format according to the physical distance of Po,d .
features, i.e., lack of comprehensive perceptions of the holistic Specifically, the number of FS’s needed can be computed as,
EON states, and therefore are unable to achieve real adaptive & '
service provisioning in EONs. b
n= BP SK
, (1)
In the meantime, recent advances in deep reinforcement m · Cgrid
learning (DRL) have demonstrated beyond human-level per-
BP SK
formance in handling large-scale online control tasks [13], where Cgrid is the data rate an FS of BPSK signal can sup-
[14]. By parameterizing policies with deep neural networks port and m ∈ [1, 2, 3, 4] corresponds to BPSK, QPSK, 8-QAM
(DNNs) [15], DRL enables learning agents to perceive com- and 16-QAM, respectively. The static RMSA problem (i.e.,
plex system states from high-dimensional input data (e.g., offline network planning) gives a set of permanent lightpath
screenshots and traffic matrices) and progressively learn cor- requests R = {Rt |t } (τ → ∞) and requires provisioning all
rect policies through experiences of repeated interactions with of them in a batch following the link capacity constraint [4].
the target systems. The application of DRL in the communi- The objective of the static RMSA problem is to minimize the
cation and networking domain has received intensive research total spectrum usage. Unlike the static problem where requests
interests during the past two years [16]–[18]. In [17], the are known in prior, in the dynamic RMSA problem (i.e., online
authors enhanced the general deep Q-learning framework in lightpath provisioning) being considered in this work, lightpath
[13] with novel exploration and experience replay techniques requests arrive and expire on-the-fly and need to be serviced
to solve the traffic engineering problem. The authors of [18] immediately upon their arrivals. The dynamic RMSA problem
presented a DRL-based framework for datacenter network aims at minimizing the long-term request blocking probability,
management and demonstrated a DRL agent which can learn which is defined as the ratio of the number of blocked requests
the optimal topology configurations with respect to different to the total number of requests over a period.
application profiles. Nevertheless, the application of DRL in
optical networking, or in particular, for addressing the RMSA III. D EEP RMSA F RAMEWORK
problem, has not been investigated.
In this paper, we propose DeepRMSA, a DRL-based RMSA Fig. 1 shows the schematic of DeepRMSA. DeepRMSA
framework for learning the optimal online RMSA policies in takes advantage of the software-defined networking (SDN)
EONs. The contributions of this paper can be summarized as paradigm for centralized and automated control and man-
follows. 1) We propose, for the first time, a DRL framework agement of the EON data plane [20]. Specifically, a remote
for optical network management and resource allocation, i.e., SDN controller interacts with the local SDN agents to collect
RMSA. 2) We propose two training mechanisms for Deep- network states and lightpath requests, and distribute RMSA
RMSA, taking into account the unique characteristics of the schemes, while the SDN agents drive the actual device con-
RMSA problem. 3) Numerical results verify the superiority of figurations according to the received commands. The operation
DeepRMSA over the state-of-art heuristic algorithms. principle of DeepRMSA is designed based on the framework
The rest of the paper is organized as follows. Section II of DRL. Upon receiving a lightpath request Rt (step 1), the
presents the RMSA problem formulation. Section III details SDN controller retrieves from the traffic engineering database
the DeepRMSA framework. In Section IV, we elaborate on key network state representations, including the in-service
the design of DeepRMSA, including the modeling and the lightpaths, resource utilization and topology abstraction, and
training mechanisms. Then, in Section V, we show the per- invokes the feature engineering module to generate tailored
formance evaluations and related discussions. Finally, Section state data st for DeepRMSA (step 2). The DNNs of Deep-
VI concludes the paper. RMSA read the state data and output an RMSA policy πt
for the SDN controller (step 3). The controller in turn takes
an action at (i.e., determining an RMSA scheme) based on
II. P ROBLEM F ORMULATION
πt and attempts to set up the corresponding lightpath (step
Let G(V, E, F ) denote an EON topology, where V and E 4). The reward system receives the outcome related to the
represent the sets of nodes and fiber links, F = {Fe,f |e,f } previous RMSA operations as feedback (step 5) and produces
contains the state of each frequency slot (FS) f ∈ [1, f0 ] on an immediate reward rt for DeepRMSA. rt , together with
each fiber link e ∈ E. We model a lightpath request from node st and at , are stored in an experience buffer (step 6), from
o to d (o, d ∈ V ) as Rt (o, d, b, τ ), with b Gb/s and τ denoting which DeepRMSA derives training signals for updating the
the bandwidth requirement and service duration, respectively. DNNs afterward (step 7). The objective of DeepRMSA upon
3

in st directly to avoid any information loss. However, this


6 (State, Action)
would dramatically increase the scale of st (i.e., requiring f0 ·
Learning |E| elements simply for conveying F ) and cause scalability
Experience Buffer 7 Signal Policy/ …
(State, Action, Value issues. Moreover, making DeepRMSA extract useful features
Reward) DRL Agent 1 DNNs DRL Agent N from the large-scale binary matrix while incorporating also
(Deep-RMSA)

State 2 Pending
the topology connectivity and the spectrum continuous and
6 Reward RMSA Policy/Action
Feature
Request & contiguous constraints in EON is not trivial. We will keep this
3 In-Service
Reward Engineering
Lightpaths &
as one of our future research tasks.
System
Resource 2) Action: DeepRMSA selects for each Rt a routing path
Traffic Utilization &
Feedback SDN
Engineering Topology
from the K candidates and one of the J FS-blocks on the
Controller
5 Database Abstraction selected path. Therefore, the action space (denoted as A)
Network States
includes K · J actions.
1 RMSA Schemes 4
Lightpath Requests 3) Reward: DeepRMSA receives an immediate reward rt of
1 if Rt is successfully serviced. Otherwise, rt = −1.
SDN Agent SDN Agent … SDN Agent
4) DNNs: DeepRMSA employs a policy DNN fθp (st ) for
Performance Topology generating the RMSA policy (i.e., the probability distribution
Monitoring over the action space) and a value DNN fθv (st ) for estimating
Clients
the value of st (i.e., the discounted cumulative reward since
st ), where θp and θv are the sets of parameters of the DNNs.
fθp (st ) and fθv (st ) share the same fully-connected DNN
architecture [15] except for the output layers. The output layer
Fig. 1. Schematic of DeepRMSA. of fθp (st ) consists of K · J neurons, while fθv (st ) has only
one output neuron.
servicing Rt is to maximize the long-term cumulative reward
defined as, 0
X
Γt = γt −t
· rt0 , (2) B. Training
t0 ∈[t,∞) We designed the training of DeepRMSA based on the
where γ ∈ [0, 1] is the discount factor that decays future framework of the A3C algorithm [14]. Basically, A3C makes
rewards. Eventually, DeepRMSA enables a self-learning ca- use of multiple parallel actor-learners (child threads of a DRL
pability that can learn and adapt RMSA policies through agent), each interacting with its own copy of the system
dynamic lightpath provisioning. Note that, by deploying mul- environment, to achieve learning with more abundant and
tiple parallel DRL agents, each for a particular application or diversified samples. The actor-learners maintain a set of global
functionality (e.g., protection [20] and defragmentation [12]), DNN parameters θp∗ and θv∗ asynchronously.
we can extend DeepRMSA to build an intact autonomic EON Different from general DRL tasks that can be mod-
system. eled as Markov decision processes (i.e., the state transition
from st to st+1 follows a probability distribution given by
IV. D EEP RMSA D ESIGN P (st+1 |st , at )), DeepRMSA involves state transitions which
are difficult to be modeled. In particular, due to the fact that
In this section, we first present the modeling of DeepRMSA,
Rr+1 can be random, there can be infinite possible states
including the definitions of state representation, action space,
for st+1 in DeepRMSA. Thus, we first slightly modified
and reward. Then, we take into account the unique charac-
the standard A3C algorithm by defining an episode as the
teristics of dynamic lightpath provisioning and develop two
servicing of N lightpath requests, and by making N equal
training mechanisms for DeepRMSA.
to the training batch size. Here, an episode defines the op-
timization scope of a DRL task. This way, we eliminate the
A. Modeling need for estimating the value of st+1 . We denote DeepRMSA
1) State: The state representation st for DeepRMSA is an with the episode-based training mechanism as DeepRMSA-
1×(2|V |+1+(2J +3)K) array containing the information of EP. Algorithm 1 summarizes the procedures of an actor-
Rt and the spectrum utilization on K candidate paths for Rt . learner thread in DeepRMSA-EP. In line 1, the actor-learner
We use 2|V | + 1 elements of st to convey o, d (in the one-hot initiates an empty experience buffer Λ. Then, for each Rt ,
format), and τ , where |V | represents the number of nodes in the algorithm checks whether Λ is empty (i.e., a new episode
V . For each of the K paths, we calculate the sizes and the starts), and if true, synchronizes the local DNNs with the sets
starting indices of the first J available FS-block, the required of global parameters (lines 3-5). Line 6 updates the EON state
number of FS’s based on the applicable modulation format, the by releasing the resources allocated to lightpaths that expire. In
average size of the available FS-blocks, and the total number line 7, we obtain st based on the model discussed in Section
of available FS’s. Hence, we aim to extract key features on IV-A. In line 8, we invoke the policy and value DNNs to
different candidate paths, from which DeepRMSA can sense generate an RMSA policy and a value estimation for st . Note
the global EON state. Note that, a more comprehensive design that, in DeepRMSA-EP, we make st include one more element
could include the original two-dimensional spectrum state F to indicate the position of Rt regarding the current episode.
4

For instance, if Rt is the i-th request of the episode, we Algorithm 1: Procedures of an actor-learner thread in
calculate a position indicator as (N − i + 1)/N . The algorithm DeepRMSA-EP
decides an RMSA scheme based on the generated policy (lines 1 initiate Λ = ∅;
9-10, i.e., with the Roulette strategy) and receives a reward
2 for each Rt do
accordingly (line 11). The RMSA sample is then stored in the 3 if Λ == ∅ then
buffer (line 12). With lines 13-21, DeepRMSA-EP performs 4 set θp = θp∗ , θv = θv∗ ;
training every time the buffer contains N samples. Specifically, 5 end
in the for-loop of lines 14-16, the algorithm first calculates for 6 release the spectra occupied by expired requests;
each sample χt0 in the buffer the discounted cumulative reward 7 obtain st with Rt and G(V, E, F );
(staring from Rt0 till the end of the episode) as, 8 calculate fθp (st ), fθv (st );
X 9 calculate the cumulative sum of fθp (st ) as ζ;
Γt0 = γ i · rt0 +i . (3)
10 decide an RMSA scheme at = arg min {ζ(a) ≥ rand()};
i∈[0,N −1],χt0 +i ∈Λ a
11 attempt to service Rt with at and receive a reward rt ;
Then, the advantage of each action being taken can be obtained 12 store (st , at , fθv (st ), rt ) in Λ;
by, 13 if |Λ| == N then
δt0 = Γt0 − fθv (st0 ), (4) 14 for each χt0 = (st0 , at0 , fθv (st0 ), rt0 ) in Λ do
15 calculate Γt0 and δt0 with Eqs. 3 and 4;
which indicates how much an action turns out be better than 16 end
estimated. Lines 17-18 calculate the policy and values losses 17 calculate Lθp and Lθv with Eqs. 5 and 6;
Lθp and Lθv , from which policy and value gradients can be 18 obtain the policy and value gradients with Lθp , Lθv ;
derived. In particular, Lθp is defined as, 19 apply the gradients to update θp∗ and θv∗ ;
1 X 20 empty Λ;
Lθp = − δt0 log fθp (st0 , at0 ) end
N χ ∈Λ 21
t0
α X X (5) 22 end
− fθ (st0 , a) log fθp (st0 , a),
N χ ∈Λ a∈A p
t0
600
1 1950
11 12
where α (0 < α  1) is a weighting coefficient. The 2400

75
300
1050

0
1500

rationale behind Eq 5 is to reinforce actions (i.e., improving 8


300
4 75 13
7 9
the probabilities) with larger advantages while encouraging 2 750
60
0 5 600
750 0
300

750
exploration (by introducing the total entropy of the policies

0
15
14

13
12
60

50
0

00

as a secondary penalty term). The definition of the value 1800


3 1800
6 10
loss is straightforward as the mean square error from value 1050

estimations, i.e.,
Fig. 2. 14-node NSFNET topology (link length in kilometers).
1 X
Lθv = (fθ (st0 ) − Γt0 )2 . (6)
N χ ∈Λ v
t0 large1 ). Then, the algorithm calculates the policy and value
Finally, in lines 19-20, the actor-learner applies the gradients losses with these N samples and updates the global DNN
to tune the global DNN parameters with training algorithms parameters accordingly. The N samples are removed from the
such as RMSProp or Adam [21], and empties the buffer to get buffer afterward. Meanwhile, the condition for synchronizing
prepared for the next episode. local DNNs (line 3 of Algorithm 1) becomes |Λ| being equal
Note that, the uncertainty of dynamic lightpath requests can to N − 1 in DeepRMSA-FLX.
result in unpredictable trajectories of st , which in turn can
cause oscillations of the cumulative rewards and destabilize V. E VALUATION
the training process. This problem becomes especially severe A. Simulation Setup
when the numbers of requests involved are small. Recall the
We evaluated the performance of DeepRMSA with numeri-
calculation of cumulative rewards in Eq. 3, Γt0 decreases when
cal simulations. We first used the 14-node NSFNET topology
χt0 is getting closer to the end of the buffer and eventually
in Fig. 2 and assumed that each fiber link could accommodate
contains the reward from only one request. To cope with this
100 FS’s. The dynamic lightpath requests were generated
issue, we propose a window-based flexible training mecha-
according to a Poisson process following a uniform traffic
nism for DeepRMSA, namely DeepRMSA-FLX. Basically,
distribution, with the average arrival rate and service duration
DeepRMSA-FLX invokes the training process each time the
being 10 and 15 time units, respectively. The bandwidth
buffer contains 2N − 1 samples. DeepRMSA-FLX slides a
requirement of each request is evenly distributed between 25
window of length N through the buffer and calculates the
and 100 Gb/s. The DNNs used ELU as the activation function
cumulative reward for each of the first N samples, still with
for the hidden layers. We set K = 5 and J = 1. Hence,
Eq. 3. Thus, every cumulative reward involves the rewards
from servicing N requests. By doing so, we aim to smooth out 1 Note that, we typically set N moderate values, e.g., 50, to allow training
the oscillations equally for all the samples (if N is sufficiently signals being applied to the DNNs quickly.
5

1000 1000

900
900
800

800 700

Reward

Reward
600
700 500

DeepRMSA-FLX (16, 128) 400 DeepRMSA-FLX (16, 128)


600
DeepRMSA-FLX (16, 256) DeepRMSA-FLX (8, 128)
300 DeepRMSA-FLX (1, 128)
DeepRMSA-FLX (16, 64)
500 200
0 50 100 150 200 0 50 100 150 200
Training Epoch ( 500) Training Epoch ( 500)

1000 1000

900
900
800

800 700
Reward

Reward
600
700 500

DeepRMSA-EP (16, 128) 400 DeepRMSA-EP (16, 128)


600
DeepRMSA-EP (16, 256) DeepRMSA-EP (8, 128)
300
DeepRMSA-EP (16, 64) DeepRMSA-EP (1, 128)
500 200
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Training Epoch ( 500) Training Epoch ( 500)

Fig. 3. Cumulative rewards from DeepRMSA-FLX and DeepRMSA-EP with different (a), (c): DNN sizes, and (b), (d): numbers of actor-learners.

DeepRMSA selected only the routing paths and applied the DeepRMSA-FLX (16, 128)
first-fit scheme for spectrum allocation. γ, α, N and the 0.2
DeepRMSA-EP (16, 128)
learning rate were set as 0.95, 0.01, 50 and 10−5 , respectively. SP-FF
Blocking Probability

KSP-FF
We used the Adam algorithm [21] for training. Note that, we
normalized every field of st before feeding it to the DNNs. 0.1

0.07
B. Numerical Results 0.06
0.05
We first assessed the impact of the scale of the DNNs on
0.04
the performance of DeepRMSA. We fixed the number of actor-
0.03
learners as 16, and implemented DNNs of three setups for both 0 50 100 150 200 250 300
DeepRMSA-EP and DeepRMSA-FLX, i.e., 3 hidden layers of Training Epoch ( 500)

64 neurons (3 × 64), 5 hidden layers of 128 neurons (5 × 128),


Fig. 4. Request blocking probability.
and 8 hidden layers of 256 neurons (8 × 256). Figs. 3(a)
and (c) show the evolutions of cumulative rewards (collected
from every 1000 requests) with different DNN setups during the algorithms, i.e., increasing the number of actor-learners
training. We can see that for both of the algorithms, DNNs leads to faster convergence and slightly higher rewards. In
with larger scales facilitate faster training. In average, it takes particular, increasing the number of actor-learners from 1 to
DeepRMSA 15, 000 and 5, 000 training epochs to converge 8 can accelerate the training speed by a factor of nearly
with DNNs of 3 × 64 and 5 × 128 (or 8 × 256), respectively. 10 as multiple parallel actor-learners enable more diversified
Eventually, the rewards associated with the three setups are explorations of the problem. Since the performance gain from
very close, with 5 × 128 performing slightly better. This is further increasing the number of actor-learners is marginal, we
because 5 × 128 enables a better ability of data representation expect DeepRMSA with 16 actor-learners to achieve the best
when compared with 3 × 64, and in the meantime does not performance. Hence, we fixed the scale of the DNNs and the
suffer from the overfitting issue as encountered by 8 × 256. number of actor-learners as 5 × 128 and 16, respectively, for
Then, we evaluated the impact of the number of actor-learners later evaluations.
by fixing the sizes of the DNNs as 5 × 128 and implementing Next, we compared the performance of DeepRMSA-EP and
DeepRMSA with 1, 8 and 16 actor-learners. Figs. 3(b) and DeepRMSA-FLX with that of the baseline algorithms, i.e.,
(d) show the corresponding evolutions of cumulative rewards. SP-FF and KSP-FF. KSP-FF has been shown to achieve the
Again, we can draw the same observations from both of state-of-art performance among the existing heuristic designs
6

0.35 8
DeepRMSA-FLX (16, 128) 2620 780
0.3 DeepRMSA-EP (16, 128) 1480
Normalized Value Loss 1520
1 1100 1320 9
0.25 4
780
0.2 420 780
2180 680
440
0.15 900 3 5 1320
10
600 1460

0.1 700
1130
640
800
1860
0.05 1460
2 6 11
1200
0 640
0 50 100 150 200 250 300 1640 1640
Training Epoch ( 500) 7

0.7
DeepRMSA-FLX (16, 128) DeepRMSA-FLX (16, 128)
0.6 DeepRMSA-EP (16, 128) DeepRMSA-EP (16, 128)
0.2 SP-FF

Blocking Probability
0.5 KSP-FF
Entropy of Policy

0.4

0.3
0.1
0.2
0.08
0.07
0.1
0.06
0 0.05
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Training Epoch ( 500) Training Epoch ( 500)

Fig. 5. (a) Normalized value loss, and (b) entropy of policy during training. Fig. 6. (a) 11-node COST239 topology (link length in kilometers), and (b)
request blocking probability with the COST239 topology.

[10]. Fig. 4 plots the evolution of request blocking proba-


bility from the algorithms. We can see that DeepRMSA-EP topology, which demonstrates a clear performance differ-
and DeepRMSA-FLX perform similarly at the beginning and ence between DeepRMSA-EP and DeepRMSA-FLX. Even-
outperform SP-FF after a training period of only 1, 000 epochs. tually, DeepRMSA-FLX can achieve a blocking probability
However, DeepRMSA-FLX successfully beats KSP-FF after that is 14.3% and 18.9% lower than those of KSP-FF and
a training period of 37, 500 epochs, whereas the performance DeepRMSA-EP, respectively.
of DeepRMSA-EP eventually merely fluctuates around that
of KSP-FF. After training of 150, 000 epochs, DeepRMSA- VI. C ONCLUSION
FLX can achieve a blocking reduction of 20.3% compared In this paper, we proposed DeepRMSA, a DRL-based
with KSP-FF. To reveal the rationale behind the behaviors RMSA framework for learning the optimal online RMSA
of DeepRMSA-EP and DeepRMSA-FLX, Figs. 5(a) and (b) policies in EONs. DeepRMSA parameterizes RMSA policies
present the results of normalized value loss and entropy of with DNNs and trains the DNNs progressively with experi-
policy during training, respectively. It can be seen that the ences from dynamic lightpath provisioning. By taking into
proposed window-based training mechanism facilitates more account the unique characteristics of the RMSA problem, we
accurate value estimations (lower value losses) and stabilized developed two training mechanisms for DeepRMSA based
training, while the training of DeepRMSA-EP starts to diverge on the framework of A3C. Simulation results show that the
after 10, 000 epochs. Note that, training periods of thousands proposed training mechanisms facilitate successful training
of epochs are too costly for practical network operations. A of DeepRMSA, which can achieve blocking reductions of
more efficient way of training DeepRMSA is expected to more than 20.3% and 14.3% in the NSFNET and COST239
be performing offline training with an RMSA simulator first, topologies, respectively, when compared with the baselines.
before enrolling it in online lightpath provisioning for fine An interesting future research topic would be partitioned
tuning [18]. DeepRMSA or hierarchical-DeepRMSA where multiple Deep-
To verify the robustness of DeepRMSA, we also per- RMSA agents cooperate hierarchically (within the same au-
formed simulations with the 11-node COST239 topology tonomous system) or interact peer-to-peer through brokers
in Fig. 6(a). We set the average request arrival rate and (in a multi-domain EON scenario [22]) to achieve scalabil-
service duration as 20 and 30 time units, respectively. All ity of DeepRMSA applied to topologies with larger scales.
the rest of the parameters remained the same as those for Meanwhile, multi-agent DeepRMSA applied to multiple au-
the evaluations with the NSFNET topology. Fig. 6(b) shows tonomous system networks will introduce game-theoretic
the results of request blocking probability with the COST239 approaches similar to the discussions in [23], [24], thus
7

yielding more interesting yet practical multi-agent competi- [21] D. Kingma and J. Ba. (2014) Adam: A Method for Stochastic Optimiza-
tive/cooperative learning problems. tion.
[22] S. J. B. Yoo, “Multi-domain cognitive optical software defined networks
with market-driven brokers,” in Proc. of ECOC, Sept. 2014, pp. 1–3.
ACKNOWLEDGMENTS [23] X. Chen, Z. Zhu, L. Sun, J. Yin, S. Zhu, A. Castro, and S. J. B. Yoo,
“Incentive-driven bidding strategy for brokers to compete for service
This work was supported in part by DOE de-sc0016700, provisioning tasks in multi-domain SD-EONs,” J. Lightw. Technol.,
and NSF ICE-T:RC 1836921. vol. 34, no. 16, pp. 3867–3876, 2016.
[24] X. Chen, Z. Zhu, J. Guo, S. Kang, R. Proietti, A. Castro, and S. J. B. Yoo,
R EFERENCES “Leveraging mixed-strategy gaming to realize incentive-driven VNF
service chain provisioning in broker-based elastic optical inter-datacenter
[1] O. Gerstel, M. Jinno, A. Lord, and S. J. B. Yoo, “Elastic optical networks,” J. Opt. Commun. Netw., vol. 10, no. 2, pp. 1–9, 2018.
networking: a new dawn for the optical layer?” IEEE Commun. Mag.,
vol. 50, pp. S12–S20, Apr. 2012.
[2] M. Jinno, B. Kozicki, H. Takara, A. Watanabe, Y. Sone, T. Tanaka, and
A. Hirano, “Distance-adaptive spectrum resource allocation in spectrum-
sliced elastic optical path network,” IEEE Commun. Mag., vol. 48, no. 8,
pp. 138–145, Aug. 2010.
[3] B. Chatterjee, N. Sarma, and E. Oki, “Routing and spectrum allocation
in elastic optical networks: A tutorial,” IEEE Commun. Surveys Tuts.,
vol. 17, no. 3, pp. 1776–1800, thirdquarter 2015.
[4] Y. Wang, X. Cao, and Y. Pan, “A study of the routing and spectrum
allocation in spectrum-sliced elastic optical path networks,” in Proc. of
INFOCOM, April 2011, pp. 1503–1511.
[5] K. Christodoulopoulos, I. Tomkos, and E. Varvarigos, “Elastic band-
width allocation in flexible OFDM-based optical networks,” J. Lightw.
Technol., vol. 29, pp. 1354–1366, May 2011.
[6] M. Klinkowski, M. Ruiz, L. Velasco, D. Careglio, V. Lopez, and
J. Comellas, “Elastic spectrum allocation for time-varying traffic in
flexgrid optical networks,” J. Sel. Areas Commun., vol. 31, no. 1, pp.
26–38, January 2013.
[7] L. Gong, X. Zhou, W. Lu, and Z. Zhu, “A two-population based
evolutionary approach for optimizing routing, modulation and spectrum
assignments (RMSA) in O-OFDM networks,” IEEE Commun. Lett.,
vol. 16, pp. 1520–1523, Sept. 2012.
[8] H. Wu, F. Zhou, Z. Zhu, and Y. Chen, “On the distance spectrum
assignment in elastic optical networks,” IEEE/ACM Trans. Netw., vol. 25,
no. 4, pp. 2391–2404, Aug 2017.
[9] Z. Zhu, W. Lu, L. Zhang, and N. Ansari, “Dynamic service provisioning
in elastic optical networks with hybrid single-/multi-path routing,” J.
Lightw. Technol., vol. 31, no. 1, pp. 15–22, Jan 2013.
[10] Y. Yin, H. Zhang, M. Zhang, M. Xia, Z. Zhu, S. Dahlfort, and S. J. B.
Yoo, “Spectral and spatial 2D fragmentation-aware routing and spectrum
assignment algorithms in elastic optical networks,” J. Opt. Commun.
Netw., vol. 5, no. 10, pp. A100–A106, Oct 2013.
[11] F. Cugini, F. Paolucci, G. Meloni, G. Berrettini, M. Secondini, F. Fresi,
N. Sambo, L. Poti, and P. Castoldi, “Push-pull defragmentation without
traffic disruption in flexible grid optical networks,” J. Lightw. Technol.,
vol. 31, pp. 125–133, Jan. 2013.
[12] M. Zhang, C. You, and Z. Zhu, “On the parallelization of spectrum de-
fragmentation reconfigurations in elastic optical networks,” IEEE/ACM
Trans. Netw., vol. 24, no. 5, pp. 2819–2833, October 2016.
[13] V. Mnih and K. Kavukcuoglu et al., “Human-level control through deep
reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
[14] V. Mnih, A. Badia, M. Mirza, A. Graves, T. Harley, T. Lillicrap,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep
reinforcement learning,” in Proc. of Int. Conf. Mach. Learn., 2016, pp.
1928–1937.
[15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, pp. 436–444, 2015.
[16] N. Luong, D. Hoang, S. Gong, D. Niyato, P. Wang, Y. Liang, and
D. Kim, “Applications of deep reinforcement learning in communi-
cations and networking: A survey,” arXiv preprint arXiv:1810.07862,
2018.
[17] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. Liu, and D. Yang,
“Experience-driven networking: A deep reinforcement learning based
approach,” arXiv preprint arXiv:1801.05757, 2018.
[18] S. Salman, C. Streiffer, H. Chen, T. Benson, and A. Kadav, “DeepConf:
Automating data center network topologies management with machine
learning,” in Proc. of NetAI, 2018, pp. 8–14.
[19] B. Kozicki, H. Takara, Y. Sone, A. Watanabe, and M. Jinno, “Distance-
adaptive spectrum allocation in elastic optical path network (SLICE)
with bit per symbol adjustment,” in Proc. of OFC, 2010, pp. 1–3.
[20] X. Chen, M. Tornatore, S. Zhu, F. Ji, W. Zhou, C. Chen, D. Hu, L. Jiang,
and Z. Zhu, “Flexible availability-aware differentiated protection in
software-defined elastic optical networks,” J. Lightw. Technol., vol. 33,
pp. 3872–3882, Sept. 2015.

You might also like