DeepRSMA
DeepRSMA
Abstract—This paper proposes DeepRMSA, a deep reinforce- EON can flexibly set up bandwidth-variable superchannels by
ment learning framework for routing, modulation and spec- grooming series of finer-granularity (e.g., 6.25 GHz) subcarri-
trum assignment (RMSA) in elastic optical networks (EONs). ers and adapting the modulation formats according to the QoT
DeepRMSA learns the correct online RMSA policies by pa-
rameterizing the policies with deep neural networks (DNNs) of lightpaths [2].
that can sense complex EON states. The DNNs are trained The flexible resource allocation mechanisms in EON, on
with experiences of dynamic lightpath provisioning. We first the other hand, make the corresponding service provisioning
modify the asynchronous advantage actor-critic algorithm and designs more complicated. To fully exploit the benefits of
present an episode-based training mechanism for DeepRMSA, such flexibilities and realize cost-effective EON, previous
namely, DeepRMSA-EP. DeepRMSA-EP divides the dynamic
provisioning process into multiple episodes (each containing the studies have intensively investigated the routing, modulation
servicing of a fixed number of lightpath requests) and performs and spectrum assignment (RMSA) problem for EON [3]. The
training by the end of each episode. The optimization target of authors of [4]–[6] first proposed integer linear programming
DeepRMSA-EP at each step of servicing a request is to maximize (ILP) models for solving the static RMSA problems, where
the cumulative reward within the rest of the episode. Thus, we all the lightpath requests are assumed to be known in prior.
obviate the need for estimating the rewards related to unknown
future states. To overcome the instability issue in the training of While the ILP models can provide the optimal solutions to
DeepRMSA-EP due to the oscillations of cumulative rewards, we the RMSA problems, they are proved to be N P-hard [4]
further propose a window-based flexible training mechanism, i.e., and are intractable for large-scale problems. In this context,
DeepRMSA-FLX. DeepRMSA-FLX attempts to smooth out the a number of heuristic or approximation algorithms have been
oscillations by defining the optimization scope at each step as a developed. In [4], Wang et al. proposed two algorithms,
sliding window, and ensuring that the cumulative rewards always
include rewards from a fixed number of requests. Evaluations namely, balanced load spectrum allocation and shortest path
with the two sample topologies show that DeepRMSA-FLX with maximum spectrum reuse, to minimize the maximum
can effectively stabilize the training while achieving blocking required spectrum resources in an EON accounting for the
probability reductions of more than 20.3% and 14.3%, when given traffic demand. The authors of [5] presented a simulated
compared with the baselines. annealing approach for determining the servicing order of
Index Terms—Elastic optical networks (EONs), Routing, mod- lightpath requests and applied the k-shortest path routing and
ulation and spectrum assignment (RMSA), Deep reinforcement first-fit (KSP-FF) scheme to calculate the RMSA solution for
learning, Asynchronous advantage actor-critic algorithm. each request afterward. In [6], [7], the authors investigated to
leverage genetic algorithms to realize joint RMSA optimiza-
I. I NTRODUCTION tions. A conflict graph based two-phase algorithm with proved
performance level was proposed in [8]. For more heuristic
al. investigated the spectrum fragmentation effect in dynamic To provision Rt , we need to compute an end-to-end routing
lightpath provisioning and proposed a fragmentation-aware path Po,d , determine a proper modulation format m to use for
RMSA algorithm to mitigate spectrum fragmentation. More QoT assurance, and allocate a number of spectrally contiguous
aggressive service reconfiguration approaches, e.g., spectrum FS’s (i.e., the spectrum contiguous constraint) on each link
defragmentation [11], [12], have also been proposed as com- along Po,d according to b and m. In this work, we assume
plements to normal RMSA algorithms to enable periodical that the EON is not equipped with the spectrum conversion
service consolidations but at the expense of high operational capability. Therefore, the spectra allocated on different fibers
costs. However, the existing works only apply fixed RMSA to Rt must align (i.e., the spectrum continuous constraint).
policies regardless of the time-varying EON states or rely We adopt the impairment-aware model in [19] to decide the
on simple empirical policies based on manually extracted modulation format according to the physical distance of Po,d .
features, i.e., lack of comprehensive perceptions of the holistic Specifically, the number of FS’s needed can be computed as,
EON states, and therefore are unable to achieve real adaptive & '
service provisioning in EONs. b
n= BP SK
, (1)
In the meantime, recent advances in deep reinforcement m · Cgrid
learning (DRL) have demonstrated beyond human-level per-
BP SK
formance in handling large-scale online control tasks [13], where Cgrid is the data rate an FS of BPSK signal can sup-
[14]. By parameterizing policies with deep neural networks port and m ∈ [1, 2, 3, 4] corresponds to BPSK, QPSK, 8-QAM
(DNNs) [15], DRL enables learning agents to perceive com- and 16-QAM, respectively. The static RMSA problem (i.e.,
plex system states from high-dimensional input data (e.g., offline network planning) gives a set of permanent lightpath
screenshots and traffic matrices) and progressively learn cor- requests R = {Rt |t } (τ → ∞) and requires provisioning all
rect policies through experiences of repeated interactions with of them in a batch following the link capacity constraint [4].
the target systems. The application of DRL in the communi- The objective of the static RMSA problem is to minimize the
cation and networking domain has received intensive research total spectrum usage. Unlike the static problem where requests
interests during the past two years [16]–[18]. In [17], the are known in prior, in the dynamic RMSA problem (i.e., online
authors enhanced the general deep Q-learning framework in lightpath provisioning) being considered in this work, lightpath
[13] with novel exploration and experience replay techniques requests arrive and expire on-the-fly and need to be serviced
to solve the traffic engineering problem. The authors of [18] immediately upon their arrivals. The dynamic RMSA problem
presented a DRL-based framework for datacenter network aims at minimizing the long-term request blocking probability,
management and demonstrated a DRL agent which can learn which is defined as the ratio of the number of blocked requests
the optimal topology configurations with respect to different to the total number of requests over a period.
application profiles. Nevertheless, the application of DRL in
optical networking, or in particular, for addressing the RMSA III. D EEP RMSA F RAMEWORK
problem, has not been investigated.
In this paper, we propose DeepRMSA, a DRL-based RMSA Fig. 1 shows the schematic of DeepRMSA. DeepRMSA
framework for learning the optimal online RMSA policies in takes advantage of the software-defined networking (SDN)
EONs. The contributions of this paper can be summarized as paradigm for centralized and automated control and man-
follows. 1) We propose, for the first time, a DRL framework agement of the EON data plane [20]. Specifically, a remote
for optical network management and resource allocation, i.e., SDN controller interacts with the local SDN agents to collect
RMSA. 2) We propose two training mechanisms for Deep- network states and lightpath requests, and distribute RMSA
RMSA, taking into account the unique characteristics of the schemes, while the SDN agents drive the actual device con-
RMSA problem. 3) Numerical results verify the superiority of figurations according to the received commands. The operation
DeepRMSA over the state-of-art heuristic algorithms. principle of DeepRMSA is designed based on the framework
The rest of the paper is organized as follows. Section II of DRL. Upon receiving a lightpath request Rt (step 1), the
presents the RMSA problem formulation. Section III details SDN controller retrieves from the traffic engineering database
the DeepRMSA framework. In Section IV, we elaborate on key network state representations, including the in-service
the design of DeepRMSA, including the modeling and the lightpaths, resource utilization and topology abstraction, and
training mechanisms. Then, in Section V, we show the per- invokes the feature engineering module to generate tailored
formance evaluations and related discussions. Finally, Section state data st for DeepRMSA (step 2). The DNNs of Deep-
VI concludes the paper. RMSA read the state data and output an RMSA policy πt
for the SDN controller (step 3). The controller in turn takes
an action at (i.e., determining an RMSA scheme) based on
II. P ROBLEM F ORMULATION
πt and attempts to set up the corresponding lightpath (step
Let G(V, E, F ) denote an EON topology, where V and E 4). The reward system receives the outcome related to the
represent the sets of nodes and fiber links, F = {Fe,f |e,f } previous RMSA operations as feedback (step 5) and produces
contains the state of each frequency slot (FS) f ∈ [1, f0 ] on an immediate reward rt for DeepRMSA. rt , together with
each fiber link e ∈ E. We model a lightpath request from node st and at , are stored in an experience buffer (step 6), from
o to d (o, d ∈ V ) as Rt (o, d, b, τ ), with b Gb/s and τ denoting which DeepRMSA derives training signals for updating the
the bandwidth requirement and service duration, respectively. DNNs afterward (step 7). The objective of DeepRMSA upon
3
State 2 Pending
the topology connectivity and the spectrum continuous and
6 Reward RMSA Policy/Action
Feature
Request & contiguous constraints in EON is not trivial. We will keep this
3 In-Service
Reward Engineering
Lightpaths &
as one of our future research tasks.
System
Resource 2) Action: DeepRMSA selects for each Rt a routing path
Traffic Utilization &
Feedback SDN
Engineering Topology
from the K candidates and one of the J FS-blocks on the
Controller
5 Database Abstraction selected path. Therefore, the action space (denoted as A)
Network States
includes K · J actions.
1 RMSA Schemes 4
Lightpath Requests 3) Reward: DeepRMSA receives an immediate reward rt of
1 if Rt is successfully serviced. Otherwise, rt = −1.
SDN Agent SDN Agent … SDN Agent
4) DNNs: DeepRMSA employs a policy DNN fθp (st ) for
Performance Topology generating the RMSA policy (i.e., the probability distribution
Monitoring over the action space) and a value DNN fθv (st ) for estimating
Clients
the value of st (i.e., the discounted cumulative reward since
st ), where θp and θv are the sets of parameters of the DNNs.
fθp (st ) and fθv (st ) share the same fully-connected DNN
architecture [15] except for the output layers. The output layer
Fig. 1. Schematic of DeepRMSA. of fθp (st ) consists of K · J neurons, while fθv (st ) has only
one output neuron.
servicing Rt is to maximize the long-term cumulative reward
defined as, 0
X
Γt = γt −t
· rt0 , (2) B. Training
t0 ∈[t,∞) We designed the training of DeepRMSA based on the
where γ ∈ [0, 1] is the discount factor that decays future framework of the A3C algorithm [14]. Basically, A3C makes
rewards. Eventually, DeepRMSA enables a self-learning ca- use of multiple parallel actor-learners (child threads of a DRL
pability that can learn and adapt RMSA policies through agent), each interacting with its own copy of the system
dynamic lightpath provisioning. Note that, by deploying mul- environment, to achieve learning with more abundant and
tiple parallel DRL agents, each for a particular application or diversified samples. The actor-learners maintain a set of global
functionality (e.g., protection [20] and defragmentation [12]), DNN parameters θp∗ and θv∗ asynchronously.
we can extend DeepRMSA to build an intact autonomic EON Different from general DRL tasks that can be mod-
system. eled as Markov decision processes (i.e., the state transition
from st to st+1 follows a probability distribution given by
IV. D EEP RMSA D ESIGN P (st+1 |st , at )), DeepRMSA involves state transitions which
are difficult to be modeled. In particular, due to the fact that
In this section, we first present the modeling of DeepRMSA,
Rr+1 can be random, there can be infinite possible states
including the definitions of state representation, action space,
for st+1 in DeepRMSA. Thus, we first slightly modified
and reward. Then, we take into account the unique charac-
the standard A3C algorithm by defining an episode as the
teristics of dynamic lightpath provisioning and develop two
servicing of N lightpath requests, and by making N equal
training mechanisms for DeepRMSA.
to the training batch size. Here, an episode defines the op-
timization scope of a DRL task. This way, we eliminate the
A. Modeling need for estimating the value of st+1 . We denote DeepRMSA
1) State: The state representation st for DeepRMSA is an with the episode-based training mechanism as DeepRMSA-
1×(2|V |+1+(2J +3)K) array containing the information of EP. Algorithm 1 summarizes the procedures of an actor-
Rt and the spectrum utilization on K candidate paths for Rt . learner thread in DeepRMSA-EP. In line 1, the actor-learner
We use 2|V | + 1 elements of st to convey o, d (in the one-hot initiates an empty experience buffer Λ. Then, for each Rt ,
format), and τ , where |V | represents the number of nodes in the algorithm checks whether Λ is empty (i.e., a new episode
V . For each of the K paths, we calculate the sizes and the starts), and if true, synchronizes the local DNNs with the sets
starting indices of the first J available FS-block, the required of global parameters (lines 3-5). Line 6 updates the EON state
number of FS’s based on the applicable modulation format, the by releasing the resources allocated to lightpaths that expire. In
average size of the available FS-blocks, and the total number line 7, we obtain st based on the model discussed in Section
of available FS’s. Hence, we aim to extract key features on IV-A. In line 8, we invoke the policy and value DNNs to
different candidate paths, from which DeepRMSA can sense generate an RMSA policy and a value estimation for st . Note
the global EON state. Note that, a more comprehensive design that, in DeepRMSA-EP, we make st include one more element
could include the original two-dimensional spectrum state F to indicate the position of Rt regarding the current episode.
4
For instance, if Rt is the i-th request of the episode, we Algorithm 1: Procedures of an actor-learner thread in
calculate a position indicator as (N − i + 1)/N . The algorithm DeepRMSA-EP
decides an RMSA scheme based on the generated policy (lines 1 initiate Λ = ∅;
9-10, i.e., with the Roulette strategy) and receives a reward
2 for each Rt do
accordingly (line 11). The RMSA sample is then stored in the 3 if Λ == ∅ then
buffer (line 12). With lines 13-21, DeepRMSA-EP performs 4 set θp = θp∗ , θv = θv∗ ;
training every time the buffer contains N samples. Specifically, 5 end
in the for-loop of lines 14-16, the algorithm first calculates for 6 release the spectra occupied by expired requests;
each sample χt0 in the buffer the discounted cumulative reward 7 obtain st with Rt and G(V, E, F );
(staring from Rt0 till the end of the episode) as, 8 calculate fθp (st ), fθv (st );
X 9 calculate the cumulative sum of fθp (st ) as ζ;
Γt0 = γ i · rt0 +i . (3)
10 decide an RMSA scheme at = arg min {ζ(a) ≥ rand()};
i∈[0,N −1],χt0 +i ∈Λ a
11 attempt to service Rt with at and receive a reward rt ;
Then, the advantage of each action being taken can be obtained 12 store (st , at , fθv (st ), rt ) in Λ;
by, 13 if |Λ| == N then
δt0 = Γt0 − fθv (st0 ), (4) 14 for each χt0 = (st0 , at0 , fθv (st0 ), rt0 ) in Λ do
15 calculate Γt0 and δt0 with Eqs. 3 and 4;
which indicates how much an action turns out be better than 16 end
estimated. Lines 17-18 calculate the policy and values losses 17 calculate Lθp and Lθv with Eqs. 5 and 6;
Lθp and Lθv , from which policy and value gradients can be 18 obtain the policy and value gradients with Lθp , Lθv ;
derived. In particular, Lθp is defined as, 19 apply the gradients to update θp∗ and θv∗ ;
1 X 20 empty Λ;
Lθp = − δt0 log fθp (st0 , at0 ) end
N χ ∈Λ 21
t0
α X X (5) 22 end
− fθ (st0 , a) log fθp (st0 , a),
N χ ∈Λ a∈A p
t0
600
1 1950
11 12
where α (0 < α 1) is a weighting coefficient. The 2400
75
300
1050
0
1500
750
exploration (by introducing the total entropy of the policies
0
15
14
13
12
60
50
0
00
estimations, i.e.,
Fig. 2. 14-node NSFNET topology (link length in kilometers).
1 X
Lθv = (fθ (st0 ) − Γt0 )2 . (6)
N χ ∈Λ v
t0 large1 ). Then, the algorithm calculates the policy and value
Finally, in lines 19-20, the actor-learner applies the gradients losses with these N samples and updates the global DNN
to tune the global DNN parameters with training algorithms parameters accordingly. The N samples are removed from the
such as RMSProp or Adam [21], and empties the buffer to get buffer afterward. Meanwhile, the condition for synchronizing
prepared for the next episode. local DNNs (line 3 of Algorithm 1) becomes |Λ| being equal
Note that, the uncertainty of dynamic lightpath requests can to N − 1 in DeepRMSA-FLX.
result in unpredictable trajectories of st , which in turn can
cause oscillations of the cumulative rewards and destabilize V. E VALUATION
the training process. This problem becomes especially severe A. Simulation Setup
when the numbers of requests involved are small. Recall the
We evaluated the performance of DeepRMSA with numeri-
calculation of cumulative rewards in Eq. 3, Γt0 decreases when
cal simulations. We first used the 14-node NSFNET topology
χt0 is getting closer to the end of the buffer and eventually
in Fig. 2 and assumed that each fiber link could accommodate
contains the reward from only one request. To cope with this
100 FS’s. The dynamic lightpath requests were generated
issue, we propose a window-based flexible training mecha-
according to a Poisson process following a uniform traffic
nism for DeepRMSA, namely DeepRMSA-FLX. Basically,
distribution, with the average arrival rate and service duration
DeepRMSA-FLX invokes the training process each time the
being 10 and 15 time units, respectively. The bandwidth
buffer contains 2N − 1 samples. DeepRMSA-FLX slides a
requirement of each request is evenly distributed between 25
window of length N through the buffer and calculates the
and 100 Gb/s. The DNNs used ELU as the activation function
cumulative reward for each of the first N samples, still with
for the hidden layers. We set K = 5 and J = 1. Hence,
Eq. 3. Thus, every cumulative reward involves the rewards
from servicing N requests. By doing so, we aim to smooth out 1 Note that, we typically set N moderate values, e.g., 50, to allow training
the oscillations equally for all the samples (if N is sufficiently signals being applied to the DNNs quickly.
5
1000 1000
900
900
800
800 700
Reward
Reward
600
700 500
1000 1000
900
900
800
800 700
Reward
Reward
600
700 500
Fig. 3. Cumulative rewards from DeepRMSA-FLX and DeepRMSA-EP with different (a), (c): DNN sizes, and (b), (d): numbers of actor-learners.
DeepRMSA selected only the routing paths and applied the DeepRMSA-FLX (16, 128)
first-fit scheme for spectrum allocation. γ, α, N and the 0.2
DeepRMSA-EP (16, 128)
learning rate were set as 0.95, 0.01, 50 and 10−5 , respectively. SP-FF
Blocking Probability
KSP-FF
We used the Adam algorithm [21] for training. Note that, we
normalized every field of st before feeding it to the DNNs. 0.1
0.07
B. Numerical Results 0.06
0.05
We first assessed the impact of the scale of the DNNs on
0.04
the performance of DeepRMSA. We fixed the number of actor-
0.03
learners as 16, and implemented DNNs of three setups for both 0 50 100 150 200 250 300
DeepRMSA-EP and DeepRMSA-FLX, i.e., 3 hidden layers of Training Epoch ( 500)
0.35 8
DeepRMSA-FLX (16, 128) 2620 780
0.3 DeepRMSA-EP (16, 128) 1480
Normalized Value Loss 1520
1 1100 1320 9
0.25 4
780
0.2 420 780
2180 680
440
0.15 900 3 5 1320
10
600 1460
0.1 700
1130
640
800
1860
0.05 1460
2 6 11
1200
0 640
0 50 100 150 200 250 300 1640 1640
Training Epoch ( 500) 7
0.7
DeepRMSA-FLX (16, 128) DeepRMSA-FLX (16, 128)
0.6 DeepRMSA-EP (16, 128) DeepRMSA-EP (16, 128)
0.2 SP-FF
Blocking Probability
0.5 KSP-FF
Entropy of Policy
0.4
0.3
0.1
0.2
0.08
0.07
0.1
0.06
0 0.05
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Training Epoch ( 500) Training Epoch ( 500)
Fig. 5. (a) Normalized value loss, and (b) entropy of policy during training. Fig. 6. (a) 11-node COST239 topology (link length in kilometers), and (b)
request blocking probability with the COST239 topology.
yielding more interesting yet practical multi-agent competi- [21] D. Kingma and J. Ba. (2014) Adam: A Method for Stochastic Optimiza-
tive/cooperative learning problems. tion.
[22] S. J. B. Yoo, “Multi-domain cognitive optical software defined networks
with market-driven brokers,” in Proc. of ECOC, Sept. 2014, pp. 1–3.
ACKNOWLEDGMENTS [23] X. Chen, Z. Zhu, L. Sun, J. Yin, S. Zhu, A. Castro, and S. J. B. Yoo,
“Incentive-driven bidding strategy for brokers to compete for service
This work was supported in part by DOE de-sc0016700, provisioning tasks in multi-domain SD-EONs,” J. Lightw. Technol.,
and NSF ICE-T:RC 1836921. vol. 34, no. 16, pp. 3867–3876, 2016.
[24] X. Chen, Z. Zhu, J. Guo, S. Kang, R. Proietti, A. Castro, and S. J. B. Yoo,
R EFERENCES “Leveraging mixed-strategy gaming to realize incentive-driven VNF
service chain provisioning in broker-based elastic optical inter-datacenter
[1] O. Gerstel, M. Jinno, A. Lord, and S. J. B. Yoo, “Elastic optical networks,” J. Opt. Commun. Netw., vol. 10, no. 2, pp. 1–9, 2018.
networking: a new dawn for the optical layer?” IEEE Commun. Mag.,
vol. 50, pp. S12–S20, Apr. 2012.
[2] M. Jinno, B. Kozicki, H. Takara, A. Watanabe, Y. Sone, T. Tanaka, and
A. Hirano, “Distance-adaptive spectrum resource allocation in spectrum-
sliced elastic optical path network,” IEEE Commun. Mag., vol. 48, no. 8,
pp. 138–145, Aug. 2010.
[3] B. Chatterjee, N. Sarma, and E. Oki, “Routing and spectrum allocation
in elastic optical networks: A tutorial,” IEEE Commun. Surveys Tuts.,
vol. 17, no. 3, pp. 1776–1800, thirdquarter 2015.
[4] Y. Wang, X. Cao, and Y. Pan, “A study of the routing and spectrum
allocation in spectrum-sliced elastic optical path networks,” in Proc. of
INFOCOM, April 2011, pp. 1503–1511.
[5] K. Christodoulopoulos, I. Tomkos, and E. Varvarigos, “Elastic band-
width allocation in flexible OFDM-based optical networks,” J. Lightw.
Technol., vol. 29, pp. 1354–1366, May 2011.
[6] M. Klinkowski, M. Ruiz, L. Velasco, D. Careglio, V. Lopez, and
J. Comellas, “Elastic spectrum allocation for time-varying traffic in
flexgrid optical networks,” J. Sel. Areas Commun., vol. 31, no. 1, pp.
26–38, January 2013.
[7] L. Gong, X. Zhou, W. Lu, and Z. Zhu, “A two-population based
evolutionary approach for optimizing routing, modulation and spectrum
assignments (RMSA) in O-OFDM networks,” IEEE Commun. Lett.,
vol. 16, pp. 1520–1523, Sept. 2012.
[8] H. Wu, F. Zhou, Z. Zhu, and Y. Chen, “On the distance spectrum
assignment in elastic optical networks,” IEEE/ACM Trans. Netw., vol. 25,
no. 4, pp. 2391–2404, Aug 2017.
[9] Z. Zhu, W. Lu, L. Zhang, and N. Ansari, “Dynamic service provisioning
in elastic optical networks with hybrid single-/multi-path routing,” J.
Lightw. Technol., vol. 31, no. 1, pp. 15–22, Jan 2013.
[10] Y. Yin, H. Zhang, M. Zhang, M. Xia, Z. Zhu, S. Dahlfort, and S. J. B.
Yoo, “Spectral and spatial 2D fragmentation-aware routing and spectrum
assignment algorithms in elastic optical networks,” J. Opt. Commun.
Netw., vol. 5, no. 10, pp. A100–A106, Oct 2013.
[11] F. Cugini, F. Paolucci, G. Meloni, G. Berrettini, M. Secondini, F. Fresi,
N. Sambo, L. Poti, and P. Castoldi, “Push-pull defragmentation without
traffic disruption in flexible grid optical networks,” J. Lightw. Technol.,
vol. 31, pp. 125–133, Jan. 2013.
[12] M. Zhang, C. You, and Z. Zhu, “On the parallelization of spectrum de-
fragmentation reconfigurations in elastic optical networks,” IEEE/ACM
Trans. Netw., vol. 24, no. 5, pp. 2819–2833, October 2016.
[13] V. Mnih and K. Kavukcuoglu et al., “Human-level control through deep
reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
[14] V. Mnih, A. Badia, M. Mirza, A. Graves, T. Harley, T. Lillicrap,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep
reinforcement learning,” in Proc. of Int. Conf. Mach. Learn., 2016, pp.
1928–1937.
[15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, pp. 436–444, 2015.
[16] N. Luong, D. Hoang, S. Gong, D. Niyato, P. Wang, Y. Liang, and
D. Kim, “Applications of deep reinforcement learning in communi-
cations and networking: A survey,” arXiv preprint arXiv:1810.07862,
2018.
[17] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. Liu, and D. Yang,
“Experience-driven networking: A deep reinforcement learning based
approach,” arXiv preprint arXiv:1801.05757, 2018.
[18] S. Salman, C. Streiffer, H. Chen, T. Benson, and A. Kadav, “DeepConf:
Automating data center network topologies management with machine
learning,” in Proc. of NetAI, 2018, pp. 8–14.
[19] B. Kozicki, H. Takara, Y. Sone, A. Watanabe, and M. Jinno, “Distance-
adaptive spectrum allocation in elastic optical path network (SLICE)
with bit per symbol adjustment,” in Proc. of OFC, 2010, pp. 1–3.
[20] X. Chen, M. Tornatore, S. Zhu, F. Ji, W. Zhou, C. Chen, D. Hu, L. Jiang,
and Z. Zhu, “Flexible availability-aware differentiated protection in
software-defined elastic optical networks,” J. Lightw. Technol., vol. 33,
pp. 3872–3882, Sept. 2015.