0% found this document useful (0 votes)
28 views15 pages

Deep Multi-Agent Reinforcement Learning With Minim

This article proposes a decentralized deep multi-agent reinforcement learning approach for service function chain partitioning that aims to address limitations of existing centralized reinforcement learning methods. The key limitations identified include lack of consideration for local resource allocation intents, incomplete information available for decision making in centralized approaches, and dependence of action spaces on network topology nodes. The proposed cooperative multi-agent approach fosters efficient communication between neighboring agents to partition service function chains across multiple infrastructure domains in a decentralized manner.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views15 pages

Deep Multi-Agent Reinforcement Learning With Minim

This article proposes a decentralized deep multi-agent reinforcement learning approach for service function chain partitioning that aims to address limitations of existing centralized reinforcement learning methods. The key limitations identified include lack of consideration for local resource allocation intents, incomplete information available for decision making in centralized approaches, and dependence of action spaces on network topology nodes. The proposed cooperative multi-agent approach fosters efficient communication between neighboring agents to partition service function chains across multiple infrastructure domains in a decentralized manner.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.DOI

Deep Multi-Agent Reinforcement


Learning with Minimal Cross-Agent
Communication for SFC Partitioning
ANGELOS PENTELAS1,2 , DANNY DE VLEESCHAUWER1 , CHIA-YU CHANG1 ,
KOEN DE SCHEPPER1 , AND PANAGIOTIS PAPADIMITRIOU2
1
Nokia Bell Labs, Antwerp, Belgium
2
Department of Applied Informatics, University of Macedonia, Greece
Corresponding author: Angelos Pentelas (e-mail: [email protected]).
This work is supported by the European Union’s Horizon 2020 research and innovation program under grant agreement no. 101017109
(“DAEMON”).

ABSTRACT Network Function Virtualization (NFV) decouples network functions from the underlying
specialized devices, enabling network processing with higher flexibility and resource efficiency. This
promotes the use of virtual network functions (VNFs), which can be grouped to form a service function
chain (SFC). A critical challenge in NFV is SFC partitioning (SFCP), which is mathematically expressed as
a graph-to-graph mapping problem. Given its NP-hardness, SFCP is commonly solved by approximation
methods. Yet, the relevant literature exhibits a gradual shift towards data-driven SFCP frameworks, such as
(deep) reinforcement learning (RL).
In this article, we initially identify crucial limitations of existing RL-based SFCP approaches. In particular,
we argue that most of them stem from the centralized implementation of RL schemes. Therefore, we
devise a cooperative deep multi-agent reinforcement learning (DMARL) scheme for decentralized SFCP,
which fosters the efficient communication of neighboring agents. Our simulation results (i) demonstrate
that DMARL outperforms a state-of-the-art centralized double deep Q-learning algorithm, (ii) unfold the
fundamental behaviors learned by the team of agents, (iii) highlight the importance of information exchange
between agents, and (iv) showcase the implications stemming from various network topologies on the
DMARL efficiency.

INDEX TERMS Network Function Virtualization, Self-learning Orchestration, Multi-Agent Reinforce-


ment Learning.

I. INTRODUCTION effect enforces an end-to-end flow processing policy.


Increasing processing demands, typically coupled with Nonetheless, instilling such a high degree of flexibility into
highly stringent performance requirements, put modern net- networks via NFV comes with a certain cost, which in this
works under severe stress. To mitigate the risk of under- case is embodied by management and orchestration (MANO)
delivery, domain experts promptly re-conceptualized several complexity. Indeed, instantiating, configuring, monitoring,
architectural components of networks, and pushed innova- and scaling SFCs comprise only a subset of operations that
tions that render them more agile, cost-effective, and high- need to be applied upon these new network constructs. To
performing. One such technology is Network Function Vir- support such features, the whole NFV endeavour is backed by
tualization (NFV), which comprises a paradigm shift from the NFV MANO framework, which consists in a hierarchical
hardware-based to virtual network functions (VNFs). As architecture that positions an NFV orchestration (NFVO)
such, common network functions can be implemented in module on top of virtualized infrastructure managers (VIMs).
the form of virtual machines or containers and be deployed Effectively, each VIM operates on a set of COTS servers
on-demand within commercial off-the-shelf (COTS) servers. (which are typically deployed in a datacenter) and has a
In addition, multiple VNFs are arranged into an ordered detailed view on critical local parameters, e.g., resource con-
sequence to form a service function chain (SFC), which in sumption levels, (anti-)affinity constraints, volume bindings,

VOLUME x, 2022 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

etc. Yet, useful information from multiple VIMs shall be con- resource allocation intents. In fact, global and local intents
veyed to the NFVO layer, since certain actions of the latter are not necessarily conflicting; on the contrary, fostering the
might require the coordination of more than a single VIM. acknowledgement of local intents can improve the resource
This is the case for the SFC partitioning (SFCP) problem, allocation efficiency of the overall system, since each local
where the diverse location constraints of VNFs raises the controller has a more precise view of its actual state.
need for their distribution across multiple VIMs.
From an algorithmic point of view, SFCP is proven to Decision-making with incomplete information. The com-
be an NP-hard optimization problem [1], which practically mon assumption that a single learning agent, positioned at
implies scalability limitations. As being typical in problems the NFVO layer (which is the standard practice in centralized
of such computational complexity, many studies model SFCP SFCP with RL), has a precise view over the entire topology,
as a mixed-integer linear program (MILP), and subsequently is somehow unrealistic. In fact, the higher we move along the
propose heuristics or approximation algorithms that can cope NFV MANO hierarchy, the more coarse-grained the infor-
with the scalability issue at the expense of solution quality mation we have at our disposal (because of data aggregation),
(e.g., [2], [3]). However, these algorithms are usually tailored which implies higher uncertainty. Conversely, if a centralized
to specific environments (e.g., certain network topologies), RL approach admits partial observability, it follows that the
constraints and objectives, thereby requiring drastic redesign decision-making process relies on incomplete information.
in case some problem parameters or goals alter. Additionally, Dependence of action space on topology nodes. In most
they generally assume perfect knowledge with regard to the works that propose centralized RL schemes for the SFCP
state of the physical network, which is highly unrealistic due problem, the action space of the learning agent coincides
to high network dynamics. with the set of available points-of-presence (PoPs). This is
To overcome these limitations, data-driven resource allo- somewhat natural, since the centralized agent has to infer
cation techniques, primarily based on reinforcement learning which is the best PoP to host a particular VNF. However,
(RL), have recently started to gain traction within the prob- the dependence of the agent’s architecture on the physical
lem space of SFCP (e.g., [4]–[12]). This algorithmic shift has nodes of the substrate makes the RL scheme particularly
its grounds on several arguments. First, rapid advancements hard to scale. That is, the larger the action set, the longer the
in the RL field and, in particular, the application of deep training time. Our argument is further strengthened by studies
neural networks within the RL framework [13], empowers which employ techniques for shrinking the action space, e.g.,
it to handle vast state spaces. Prior to this development, RL clustering a set of points-of-presence (PoPs) into a single
methods were unable to cope with large environments since group, or placing the entire SFC within a single PoP (e.g., [7],
quantifying each state-action combination required testing it [11]).
multiple times. Second, RL methods are inherently purposed
to design their own optimization strategies (i.e., policies).
B. CONTRIBUTIONS
They achieve this merely by interacting with the environment
upon which they are applied and by receiving a problem- Admittedly, targeting for a single solution that addresses
specific reward signal that expresses the decision quality. all of the aforementioned limitations would be extremely
Given that these features are aligned with the zero-touch optimistic. However, one can easily observe that most short-
network automation endeavour set out by operators, it is comings stem from the centralized implementation of SFCP.
only natural that RL algorithms have become state-of-the-art Indeed, a decentralized approach that assigns learning agents
SFCP methods. locally to each VIM, in conjunction with a module at the
NFVO layer that coordinates the resulting local decisions,
A. LIMITATIONS OF CENTRALIZED SFCP WITH RL would alleviate many of these limitations. In particular, (i)
the global and local intents would be naturally decoupled, (ii)
With regard to centralized and RL-driven SFCP, the relevant
each agent would act based on detailed local observations;
literature exhibits certain limitations, the most crucial of
thus, from the perspective of the overall system, the true
which are listed and analyzed below:
state would be fully observable, and (iii) it would be easier
Association of intents. A critical limitation of centralized to decouple action spaces from the physical topology. Along
SFCP with RL is the association of the resource alloca- these lines, this work focuses on the investigation, the devel-
tion intents of the NFVO with the respective intents of the opment, and the evaluation of a decentralized SFCP scheme
underlying VIMs. That is, as it is common in hierarchical based on cooperative deep multi-agent reinforcement learn-
resource management systems with global and local con- ing (DMARL). Our main contributions are the following:
trollers (e.g., NFVOs and VIMs, VIMs and hypervisors, k8s
master and worker nodes), the global controller queries the • We devise a DMARL framework for decentralized
local controllers about their available resources, and uses this SFCP. Our DMARL algorithm consists of independent
information to compute a resource allocation decision (which double deep Q-learning (DDQL) agents, and fosters the
is eventually realized by the latter). Effectively, this limits the exchange of concise, yet critical, information among
inherent capacity of the local controllers to express their own them.

2 VOLUME x, 2022

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

• We develop a DDQL algorithm for centralized SFCP,


and compare its performance with DMARL.
• We quantify fundamental rules identified by the team of
agents over the course of training.
• We examine the impact of (imperfect) cross-agent com-
munication across various substrate network topologies.
The remainder of this article is structured as follows. In
Section II, we give a solid description of SFCP and basic
formulations for system models. In Section III, we elaborate
FIGURE 1: An SFCP problem instance. On top, an SFC-R
on deep RL and present a single-agent DDQL algorithm
GR . At the bottom, a multi-PoP topology GS . Resources of
for SFCP. In Section IV, we cast SFCP as a multi-agent
physical links are not exhaustively depicted for figure clarity.
learning problem, and devise a DMARL algorithm to address
it. Section V includes our evaluation results and in Section VI settings, and it accounts for the limited information that can
we discuss related work. A thorough discussion is given in be exchanged between PoPs and the centralized controller.
Section VII, and conclusions are laid out in Section VIII. Irrespective of the number of domain administrators, SFCP
boils down to the problem of computing a mapping be-
II. PROBLEM FORMULATION tween the virtual elements of an SFC and their physical
In this section, we commence with a high-level description counterparts, while adhering to resource capacity constraints.
of SFCP and, subsequently, we formalize important concepts Specifically, VNFs are assigned to PoPs in a one-to-one
via system models. fashion, while virtual links are assigned to physical links
in a one-to-many fashion. Naturally, the VIM of each PoP
A. PROBLEM DESCRIPTION then needs to compute a placement of the sub-SFC elements
As already explained, a SFC is an ordered sequence of VNFs, (i.e., subset of VNFs and vlinks) onto physical servers and
which enforces a flow processing policy. For instance, Fire- intra-datacenter links, but this topic is not within the scope
wall → Packet inspection → Load Balancer is a typical SFC of our work. An illustration of an SFCP problem instance is
which can be seen useful in, e.g., securing and load balancing given in Fig. 1. An SFC (consisting of three VNFs) at the top
a web application. Effectively, flows have to traverse the of the figure is considered to be partitioned over a six-PoP
entire set of VNFs with the specified order to reach their topology at the bottom of the same figure. Both VNFs and
destination. virtual links require resources (dark circular sectors), while
To delve deeper, each VNF of the SFC requires computing PoPs and physical links are characterized by available and
resources (e.g., CPU) for packet processing. Additionally, allocated resources (light and dark circular sectors, respec-
connections among consecutive VNFs, henceforth termed as tively). Further, each SFC contains a source and a destination,
virtual links (vlinks), require network resources (e.g., band- while each PoP is represented by its coordinates in a two-
width) in order to forward traffic from one VNF to another. dimensional space.
Such computing and network resources are offered by a Optimality with respect to SFCP can take numerous forms.
multi-datacenter system. Specifically, computing resources For instance, it can be seen through the lens of load bal-
are offered by servers, while network resources are allocated ancing [18], throughput maximization [5], latency minimiza-
from physical links. It is also quite common to assume that tion [3], fault tolerance (reliability) [11], cross-service com-
computing resources are offered by a PoP (e.g., a datacenter). munication [19], or resource efficiency [20]. Our work does
Whether PoPs belong to a single or multiple administra- not intend neither to add another problem dimension nor
tive domains has a crucial effect on SFCP. Concretely, in to further explore an existing one; it rather focuses on the
multi-domain settings (where PoPs are managed by multiple investigation of an RL scheme that is more aligned with
administrators), resource utilization parameters are confined current system architectures and goals. Here, we opt for
within each domain. Thereby, SFCP is typically solved via the latency minimization objective, since a key problem for
resource bidding mechanisms, e.g., [14], [15], and [16]. On SFCP is to fulfill the stringent low-latency requirements.
the other hand, there are no information disclosure con-
cerns in single-domain scenarios. However, this leads to the B. SYSTEM MODELS
common misunderstanding that each bit of information can Service function chain request (SFC-R) model. We model
be easily conveyed to a centralized controller for decision- an SFC-R with the 5-tuple < GR , src, dst, t, ∆t >. GR =
making with full observability. While theoretically possible, (VR , ER ) is a directed graph, where VR and ER denote
the above practice is simply ineffective, since it implies high the set of nodes and edges (i.e., VNFs and virtual links),
communication overheads [17]. Effectively, it is more real- respectively. Additionally, dR (i) expresses the CPU demand
istic to assume that PoPs expose descriptive statistics about of VNF i ∈ VR , while dR (i, j) represents the bandwidth
their resources, such as the average residual CPU capacity of requirements of (i, j) ∈ ER (both expressed as percentages).
their servers. To this end, our work considers single-domain src and dst indicate the source and the destination coordi-
VOLUME x, 2022 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

nates of the SFC, while t expresses its arrival time, and ∆t An important concept here is the action-value function,
denotes its lifespan. which quantifies the agent’s incentive of performing a spe-
We note that both src and dst are modeled as auxiliary cific action on a particular state:
VNF nodes with zero resource requirements (i.e., src, dst ∈ "∞ #
VR , and dR (src) = dR (dst) = 0). Naturally, this introduces X
two additional edges to the SFC graph, one from the src Q(st , xt ) := Et γ k Rt+k |st , xt , (st , xt ) ∈ S × X
k=0
towards the first VNF, and one from the last VNF towards the (1)
dst (see top of Fig. 3). Further, we assume that the src and
dst of an SFC correspond to coordinates of arbitrary PoPs, Apparently, if the agent explores its action selection on
denoted by usrc and udst ∈ VS . the environment long enough such that Q-values represent
ground truth reward values, then finding the optimal policy
Substrate network model. We model a substrate network becomes trivial. That is, π ∗ (st ) = argmaxx∈X Q(st , x).
with an undirected graph GS = (VS , ES ), where VS and ES However, Q-values are treated as estimates, since the as-
denote the sets of nodes and edges. Since we consider GS sumption of perfect exploration is hardly ever realistic. To
to be a PoP-level topology (i.e., a network of datacenters), this end, the agent needs to balance exploration and ex-
VS corresponds to PoPs, and ES to inter-PoP links. dS,t (u) ploitation, where the former shall strengthen its confidence
expresses the available CPU of node u at time t (i.e., the on Q estimations, and the latter will enable it to benefit
average available CPU of servers belonging to PoP u), and from accumulated knowledge. The most common approach
dS,t (u, v) denotes the available bandwidth of the edge (u, v) to achieve a favourable exploration-exploitation trade-off is
at time t (again, as percentages). Last, D(u, v) represents the via an ϵ − greedy action selection strategy. According to
propagation delay of a physical link (u, v). it, the agent performs a random action with probability ϵ,
while with probability 1 − ϵ the action with the highest Q-
III. SINGLE-AGENT REINFORCEMENT LEARNING value is selected. Given an initial ϵ, i.e., ϵ0 , and a decay
This section discusses single-agent RL optimization. We factor ϵdecay ∈ (0, 1), the agent can gradually shift from
initially outline the common RL setting, and proceed with exploratory to exploitative:
a short description of the state-of-the-art DDQL algorithm.
Subsequently, we elaborate on our proposed DDQL approach
ϵt = ϵ0 · (ϵdecay )t (2)
for SFCP.
B. DOUBLE DEEP Q-LEARNING
A. BACKGROUND A well-known method that leverages the notions above is Q-
The typical RL setting considers a learning-agent, an envi- learning, which is a model-free1 , off-policy2 RL algorithm.
ronment, a control task subject to optimization, and a discrete Q-learning attempts to find optimal policies via a direct
time horizon H = {1, 2, ...}, which can as well be infinite. estimation of Q-values, which are maintained in a tabular
At time t ∈ H, the agent observes the current state of the format. Each time a state-action pair (st , xt ) is visited, the
environment st ∈ S, with S indicating the entire set of respective Q-value is updated. An apparent restriction here is
possible states. Equipped with a set of actions X, the agent the lack of generalization capacity. That is, in problems with
shall interact with the environment by choosing xt ∈ X. large state-action spaces, the risk of scarce Q-updates is high.
Then, the agent receives feedback regarding the quality of its To overcome this limitation, Mnih et al. [13] propose a
action via a reward signal rt , and the environment transitions Q-function approximation scheme based on deep neural net-
to the subsequent state st+1 ∈ S. works (NNs), commonly termed as deep Q-learning (DQL).
Conventionally, a RL task is modelled as a Markov deci- As per DQL, the learning capacity of the agent lies in a NN
sion process, defined by the tuple < S, X, S1 , T, R >. Here, which approximates the Q-function by parameterizing it, i.e.,
S and X are as before. S1 ∈ P (S) denotes the starting Q(s, x) ∼ Q(s, x; θ) (where θ represents a vector of weights
state distribution, and T : S × X → P (S) is the state and biases of the NN). Ultimately, given a state as input, the
transition function, where P (·) denotes the set of probability NN computes a Q-value for every possible action. Then, it is
distributions over a set. Finally, R : S×X×S → R expresses up to an action selection strategy (e.g., ϵ − greedy) to choose
the reward function, which is generally denoted as Rt instead a particular action.
of R(st , xt , st+1 ) for simplicity. Naturally, the aim of the Two mechanisms that contribute to the success of DQL
agent isPto maximize the accumulated discounted reward over are the replay memory and the coordination of an online
H, i.e., t∈H γ t Rt , where γ ∈ [0, 1) is a discount factor em- NN and a target NN [21]. The replay memory is used to
ployed to prioritize immediate reward signals against those store state-action-reward-next state transitions; with a proper
projected farther into the future. To achieve that, the agent sampling over it, we can subsequently train the online NN
needs to perform a sequence of actions based on a learned with uncorrelated data. The interplay of the two NNs works
policy π, which is practically a function that maps states to 1 It does not intend to discover the transition function.
probability distributions over actions, i.e., π : S → P (X), 2 While following a specific policy, it assesses the quality of a different
and the optimal policy is denoted as π ∗ . one.

4 VOLUME x, 2022

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

ot = |VR |, src.loc, dst.loc,


| {z }
SFC state
dR (i), dR (i− , i), dR (i, i+ ), i.order,
| {z }
VNF state

(u.loc, dS,t (u), ∀u ∈ VS ), (dS,t (u, v), ∀(u, v) ∈ ES )
| {z }
substrate state

FIGURE 2: Illustration of the typical DDQL architecture. Notice that, since ot does not hold information about the in-
dividual servers of the topology (which are the elements that
as follows: initially, the target NN is an exact replica of the actually host VNFs), the environment is partially observable,
online NN, meaning that they share the same architecture, hence we use observation ot instead of state st .
weights and biases. However, it is only the online NN that is Actions. In our centralized implementation, an action deter-
updated at every training step, while its weights and biases mines which PoP will host the current VNF i. Concretely,
are copied to the target NN every δ training steps. The target the agent’s action set is X = {1, 2, ..., |VS |}. Notice that,
NN (represented by θ− ) is used to temporarily stabilize the if decision steps referred to the placement of an entire SFC
target value which the online NN (represented by θ) tries to instead of individual VNFs, then the action space would have
predict. Effectively, the training of the learning agent boils been substantially larger (i.e., |X| = |VS ||VR | ). This would
down to the minimization of Eq. (3), also known as loss have an adverse effect on the algorithm’s performance.
function. In the target term of Eq. (3), the greedy policy
is evaluated by the online NN, whereas the greedy policy’s Reward. A VNF placement is deemed successful, if the
value is evaluated by the target NN. As per [21], this reduces selected PoP has at least one server with adequate resources
the overestimation of Q-values, which would be the case if to host it. An SFC placement (which is the ultimate goal) is
both the greedy policy and its value have been evaluated by deemed successful, if all of its VNFs have been successfully
the online NN. An illustration of the DDQL framework is placed and all of its virtual links are assigned onto physical
given in Fig. 2. paths that connect the VNFs correctly. For every successful
VNF placement, the agent receives rt = 0.1. If the current
VNF is the last VNF of the chain (i.e., terminal VNF) and
h
L(θt ) = E rt + γQ st+1 , argmaxQ(st+1 , x; θt ); θt−

x∈X is successfully placed, then the virtual link placement com-
| {z
target
} mences (using Dijkstra’s shortest path algorithm for every
(3) adjacent VNF pair - similar to [6] and [10]). If this process is
2 i successfully completed, the reward is computed as follows:
− Q(st , xt ; θt )
| {z }
predicted |opt_path| + 1
C. DDQL FOR SFCP rt = 10 · (4)
We devise a single-agent RL algorithm for SFCP to serve as |act_path| + 1
a baseline for comparison with our multi-agent RL scheme. where |opt_path| is the length of the shortest path between
Specifically, we opt for the DDQL method, as it has demon- the src.loc and the dst.loc (recall that these are always
strated promising results within our problem space (e.g., [7], associated with PoP locations), and |act_path| is the length
[11], [22]). In our implementation, a training episode consists of the actual path established by the DDQL algorithm. Ap-
of the placement of an entire SFC, whereas a decision step parently, the best reward rt = 10 is given when |opt_path| =
refers to the placement of a single VNF of the SFC. |act_path|. In case the placement of any VNF or virtual link
Observations. At time t, the agent observes information ot fails (due to inadequate physical resources), then rt = −10
about the current VNF i and the SFC GR that it belongs and a new training episode initiates.
to, as well as information about the available resources of Architecture. As hinted by previous discussions, the func-
the substrate physical topology GS . In particular, the agent tionality of our DDQL agent relies on four key elements,
receives the length of the SFC (|VR |), the coordinates of namely an online NN, a target NN, a replay memory, and an
the src (src.loc) and dst (dst.loc) nodes of the SFC, the action selection strategy. Both NNs comprise an input layer
CPU demand of VNF i (dR (i)), the ingress (dR (i− , i)) and whose size equals the length of the observation vector ot
egress (dR (i, i+ )) bandwidth requirements of VNF i, as well (i.e., |ot | = 3|VS | + |ES | + 9), two fully-connected hidden
as its order (i.order) in the SFC. Regarding the physical layers with 256 neurons each, and an output layer with |VS |
network, the agent observes the coordinates (u.loc) and the neurons. All layers are feed-forward, the activation function
average available CPU (dS,t (u)) of each PoP u ∈ VS , and applied on individual cells is Rectified Linear Unit (ReLU),
the available bandwidth (dS,t (u, v)) of each link (u, v) ∈ ES . and the loss function used is Eq. (3). The replay memory
That is: is implemented as a queue, which stores the latest 10, 000
VOLUME x, 2022 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

Algorithm 1 DDQL training procedure Algorithm 2 Environment transition procedure


1: Input: sfc_dataset, topology 1: Input: action
2: env = Environment(sfc_dataset, topology) 2: done = False
3: agent = DDQLAgent(topology) 3: PoP = decode(action)
4: for sfc in sfc_dataset do ▷ training episode 4: n_success = place_node(vnf, PoP)
5: done = F alse 5: if n_success and vnf.is_terminal then
6: state = env.reset() 6: l_success, act_path, opt_path = place_links(sfc, topo)
7: env.place_src_dst() 7: if l_success then ▷ successful SFCP
8: score = 0 8: reward = 10 · |opt_path|+1
|act_path|+1
9: while not done do ▷ training step 9: else ▷ insufficient link capacities
10: action = agent.choose_action(state) 10: reward = −10
11: new_state, reward, done = env.step(action) 11: end if
12: score += reward 12: new_state = next_sfc()
13: agent.store(state, action, reward, new_state) 13: done = True
14: agent.learn() ▷ train the online NN 14: else if n_success then ▷ successful VNF allocation
15: state = new_state 15: reward = 0.1
16: end while 16: new_state = next_vnf()
17: end for 17: else ▷ insufficient PoP capacity
18: reward = −10
19: new_state = next_sfc()
environment transitions. The online NN samples 64 random 20: done = True
transition instances out of the replay memory at every train- 21: end if
ing step, while the target NN is updated every δ = 20 steps. 22: return new_state, reward, done
Last, we employ an ϵ − greedy action selection strategy,
where, in Eq. (2), we set ϵ0 = 1 and ϵdecay = 0.9998.
Training. The instrumentation of the above is summarized siders a set of n agents A = {1, ..., n} interacting with the
in Algs. 1 and 2. In more detail, Alg. 1 describes the steps environment. At each time step t ∈ H, each agent α ∈ A
of the DDQL training process, which takes place over a observes oα α
t , and draws an action xt from its own designated
collection of SFCs (sfc_dataset) and a physical PoP topology action set X . The joint action xt = (x1t , ..., xnt ) is then
α
(topology) (line 1). At each episode (line 4), the agent han- applied on the environment, which transitions to a new state
dles the placement of a unique SFC. Prior to any action, the st+1 . In extension, each agent α observes oα t+1 . Similar to
environment resets to a new state in line 6 (i.e., we increment single-agent RL, the transition (st , xt , st+1 ) is evaluated by
the SFC index in the sfc_dataset by one, obtain the first VNF means of a reward function R, and a scalar rtα is sent to each
of the current SFC, and generate random loads for PoPs and agent. In a fully cooperative MARL setting, all agents share
physical links), and assigns the src and dst of the SFC to the same reward, i.e., rtα = rt , ∀ α ∈ A.
the respective PoPs (line 7). The core learning process lies In principle, the dynamics of the above scheme can be
within lines 9-16. First, the agent chooses an action based captured by a cooperative Markov game, typically defined
on the current state (line 10), the environment transitions to by the tuple < S, X, S1 , T, R, Z, O, n >. Here, S, S1 , T
a new state based on the action being taken (line 11), the and R are as in Section III-A. X = X 1 × ... × X n denotes
transition is stored in the replay memory of the agent (line the joint action space, Z expresses the space of observations,
13), the online NN is trained (line 14), and the current score and O : S × A → Z is the observation function that dictates
and state are updated (lines 12 and 15). The transitions of the the partial observability of each agent (i.e., O can be seen
environment to a new state are further described in Alg. 2, as the function that maps a state st and an agent α to an
which practically implements the reward computation and observation oα t ∈ Z). In this setting, the goal of each agent
episode termination logic described earlier (see the Reward α is to discover a policy π α : Z → P (X α ), such that the
paragraph). joint policy3 π = (π 1 , ..., π n ) : S → P (X) maximizes the
(discounted) accumulated reward.
IV. MULTI-AGENT REINFORCEMENT LEARNING
We herein introduce cooperative multi-agent reinforcement B. INDEPENDENT Q-LEARNING
learning (MARL). Then, we discuss independent Q-learning,
Undoubtedly, the simplest way of establishing a cooperative
which is the foundation of the proposed solution. Finally, we
MARL framework is to treat each learning agent as an
describe in-detail a DMARL scheme for SFCP.
independent learning module. In this setting, if each agent is
realized as a Q-learning agent, the respective MARL system
A. COOPERATIVE MARL
MARL builds upon the fundamental blocks of single-agent 3 Here, we implicitly assume that the union of partial observations of all
RL, presented in Section III-A. In particular, MARL con- agents can compose the state space, i.e., ∪α∈A oα
t = st , ∀t.

6 VOLUME x, 2022

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

is known as independent Q-learning (IQL) [23]. A major and β over v). Effectively, α receives the location coordinates
limitation of IQL is its lack of convergence guarantees, pri- and the average CPU capacity from every PoP v managed by
marily stemming from the fact that the environment is non- a neighboring agent β ∈ N (α). As it will become apparent,
stationary from the perspective of each agent. Effectively, this cross-agent communications are pivotal for the efficiency of
means that for an agent α, both its reward rt and its next the system, since they enable agents to reason about the
observation oα t+1 are not solely conditioned on its current state of other agents. Yet, to maintain the communication
observation oα α
t and action xt . In other words, the transitions overhead low, agents do not share their entire local state (i.e.,
of the environment, and, in extension, the common rewards, the available CPU of each server - dS,t (us ), ∀s ∈ u), rather
are affected by the actions of other agents as well. Even a single descriptive value (i.e., the average available CPU
though obtaining an optimized joint policy π while agents across all servers - dS,t (u)).
treat other agents as part of the environment is seemingly As explained in Section IV-B, agents within MARL set-
difficult, IQL exhibits good empirical performance [24]. tings operate over non-stationary environments, which prac-
Cooperative MARL is an active research field, given that tically implies that old experiences might become highly
numerous control tasks can be seen through the lens of irrelevant as agents shift from exploratory to exploitative. To
multi-agent systems. In the context of collaborative DQL this end, we include a fingerprint [27] into the observation,
agents, recent works (e.g., [25] and [26]) have established namely ϵ (the probability to select a random action), as a
frameworks that promote joint training of multiple agents means to distinguish old from recent experiences.
(i.e., centralized training - decentralized execution), under The last elements that are inserted into oα t are three bi-
the assumption that the Qtotal -function of the multi-agent nary flags, namely hosts_another, hosts_previous, and
system can be decomposed into individual Q-functions such in_shortest_path, which are computed by agent α prior
that, if each agent α maximizes its own Qα , then Qtotal is to action selection. In particular, hosts_another = 1 if
also maximized. any VNF of the current SFC has been placed in PoP u,
Irrespective of the promising advances in the field, our hosts_previous = 1 if the previous VNF of the current VNF
work builds upon typical IQL for the following reasons. First, has been placed in PoP u, and in_shortest_path = 1 if PoP
IQL is simple enough to facilitate the interpretation of the u is part of the shortest path between the src and the dst of
system’s learning behavior, as it avoids complex NN training the SFC, while they are zero otherwise. These three flags will
schemes. Second, as it is apparent in Section IV-C, we be proven crucial for analyzing the behavior of the DMARL
envisage an action coordination module at the NFVO layer system in Section V-C. As such, we define:
to handle the constraints of SFCP, which can significantly
benefit the overall IQL framework. oα
t = |VR |, src.loc, dst.loc,
| {z }
SFC state
C. DMARL FOR SFCP dR (i), dR (i− , i), dR (i, i+ ), i.order,
| {z }
We implement a DMARL scheme for SFCP. Specifically, we VNF state
generate a single DDQL agent (as described in Section III-B) u.loc, (dS,t (us ), ∀s ∈ u), (dS,t (u, v), ∀v ∈ N (u)), ϵ,
for every PoP u in the substrate network GS . For the rest of | {z }
local state (resources and fingerprint)
this section, we assume that agent α is associated with PoP
u, and agent β with PoP v. hosts_another, hosts_previous, in_shortest_path,
| {z }
local state (flags)
Observations. At time t, each agent α ∈ A observes infor- 
mation oα (v.loc, dS,t (v), ∀v ∈ N (u)
t about the current VNF i and the SFC GR to which | {z }
this VNF belongs, as well as the available resources of the neighbors’ condensed state
substrate physical topology GS . In detail, agent α receives
the length of the SFC, the coordinates of src and dst nodes where N (u) denotes the neighboring nodes of u in the graph
of the SFC, the CPU demand of VNF i, the ingress and GS . Note that, contrary to the single-agent case, here the
egress bandwidth requirements of VNF i, and the order of team of agents observes the true state of the environment, i.e.,
this VNF in the SFC, similar to the single-agent observation ∪α∈A oα t = st . However, from the point of view of a single
mentioned in Section III-C. Regarding the physical network, agent, the environment is still partially observable.
agent α observes the coordinates of PoP u, the available CPU Actions. Agents are equipped with a set of actions that
(dS,t (us )) of every server s ∈ u, and the available bandwidth allows them to express their intents based on their local
of each physical link connected to PoP u. state. In detail, all agents share the same discrete set of
We further augment the observation space of agents by actions X = {0, 1, 2}, where 0 indicates low willingness, 1
enabling cross-agent communications. Specifically, we de- expresses a neutral position, and 2 indicates high willingness,
fine N (α) to be the set of neighboring agents of α, i.e., with respect to hosting the current VNF. Note that the above
N (α) = {β : distance(α, β) = 1, ∀β ∈ A − {α}}, action sets are topology-agnostic (in contrast to the action set
where distance(α, β) refers to the length of the shortest path of our single-agent DDQL, where there is one action for each
between PoPs u and v in GS (recall that α operates over u PoP).
VOLUME x, 2022 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

Algorithm 3 DMARL training procedure


1: Input: sfc_dataset, topology, n_actions
2: env = Environment(sfc_dataset, topology)
3: marl = MARL(topology, n_actions)
4: for sfc in sfc_dataset do ▷ training episode
5: done = F alse
6: state = env.reset()
7: env.place_src_dst()
8: score = 0
9: while not done do ▷ training step
10: soft_actions = marl.choose_actions(state)
11: firm_action = env.coordinate(soft_actions)
12: new_state, reward, done = env.step(firm_action)
13: score += reward
14: marl.store(state, soft_actions, reward, new_state)
15: marl.train() ▷ train the online NNs of all agents
16: state = new_state
FIGURE 3: The proposed DMARL scheme is decomposed 17: end while
from step 0 to step 6 and mapped to the NFV MANO 18: end for
framework.

layers. In particular, the input layer of each DDQL agent α in


Action coordination. The joint action xt = (x1t , ..., xnt )
the DMARL setting has |oα t | neurons, while its output layer
of all agents is forwarded to an action coordination module
consists of three neurons (one for each action). The API calls
positioned at the NFVO layer. This treats individual actions
depicted in step 2 implement the cross-agent communication
as soft decisions and eventually computes the firm decision
functionality, where each agent retrieves condensed informa-
according to the overall objective (e.g., load balancing, con-
tion from its neighbors. Step 3 refers to the computation of
solidation) and constraints (e.g., each VNF at exactly one
local state, i.e., monitoring of available CPU of each server
PoP). Here, we opt for a simple action coordination scheme
and updating the three binary flag values. Afterwards, agents
where (i) the agent with the highest action (ties broken
convey their soft actions via the NBIs of their VIMs towards
arbitrarily) will enable its associated PoP to be selected to
the southbound interface (SBI) of the NFVO (step 4), where
host the current VNF, and (ii) the objective is solely dictated
individual actions are aggregated into a joint action that is
by the reward function, meaning that the coordinator does not
handed over to the coordination module. The latter resolves
intend to achieve another objective. For example, if the joint
β potential conflicts and, in principle, assesses the individual
action is xt = (xα t = 2, xt = 0) for the placement of a preferences with respect to the overall objective (step 5).
VNF i, the action coordination module will infer that i will
Finally, the firm decision, which indicates the PoP that will
be positioned at node u, which is the PoP of agent α, since
β host the current VNF, is sent to the respective VIM (step 6).
xαt > xt . This process is repeated until all VNFs and virtual links are
Reward. Our DMARL algorithm adopts a reward scheme assigned, or until the SFCP fails (virtual links are assigned
similar to the one presented in the single-agent case. That via Dijkstra’s method, similar to the single-agent case - cf.
is, for every successful VNF placement, all agents receive Section III-C).
r = 0.1, and for every successful SFC placement the
Training. The training procedure followed by the proposed
common reward is computed in Eq. (4). In case of a failed
DMARL scheme is outlined in Alg. 3, which exhibits many
SFCP attempt (inadequate physical resources), agents receive
similarities with Alg. 1. Their core differences are as fol-
r = −10.
lows. First, in line 1, the number of actions (n_actions)
Architecture. The proposed DMARL scheme is shown in is provided as input. That is, the number of actions is no
Fig. 3, which illustrates the mapping of the proposed ele- longer determined by the number of PoPs; it rather becomes
ments and functionalities onto the NFV MANO framework. a parameter of the multi-agent framework. Specifically, the
At the top, we depict an SFC request which is conveyed number of actions affects (i) the capacity of agents to express
to the NFVO via its northbound interface (NBI) in step 0. their resource allocation intents and (ii) the duration of the
During step 1, the NFVO forwards the VNF and the SFC system’s convergence. As a good compromise, we always
states to the underlying VIMs, and we assume that each VIM set the number of actions to three in this work, i.e., actions
is associated with a single PoP. Additionally, each VIM is are taken from the set {0, 1, 2}. Another difference between
equipped with a DDQL agent similar to the one described in the two algorithms lies in the manner in which the state is
Section III-C, their only difference being the input and output interpreted in line 6. In Alg. 3, state refers to the VNF and
8 VOLUME x, 2022

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

(a) Mesh (b) Star (c) Linear


FIGURE 4: Topologies considered for evaluation.
(a) Learning progress (b) Optimal and rejected SFCs
the SFC state only, which is shared across all agents (see
step 1 in Fig. 3). However, in line 10, further subroutines
are called so that each agent chooses an action based on
additional observations, such as its local state and neighbor-
ing (condensed) states (cf. steps 2 and 3 in Fig. 3). In lines
11 and 12, the coordination module derives the firm action,
and the environment transitions to a new state based on this
action, as explained in steps 4 to 6 in Fig. 3. In line 14, each
agent stores its transition into its replay memory (each agent (c) SFC partitioning footprint
expands the state variable), and in line 15 all agents train their FIGURE 5: Performance comparison between DDQL and
online NNs. DMARL.

V. PERFORMANCE EVALUATION (i.e., GS ) in our evaluations, which have been selected to


This section covers a wide range of evaluation aspects, pri- unveil critical properties of the multi-agent framework. In
marily focusing on the proposed DMARL algorithm. Ini- particular, each topology comprises five PoPs; hence, the
tially, we present the simulation settings and the main per- (single-agent) DDQL algorithm works with five actions (one
formance metrics that we take into consideration. Then, we per PoP), and the DMARL framework generates five in-
compare the single-agent DDQL method with the DMARL dependent DDQL agents (one per PoP). We note that the
scheme, and elaborate on the learning behavior developed by auxiliary src and dst VNFs of an SFC are associated with
the team of agents. Subsequently, we evaluate the cross-agent random PoPs (i.e., src.loc and dst.loc might even coincide;
communication feature, and measure its implications across in this case, the length of the optimal path |opt_path| is zero).
various physical network topologies. Finally, we experiment While we employ several micro-benchmarks in order to in-
with an imperfect cross-agent communication scheme, which terpret and assess the algorithmic behaviors, two informative
is more grounded to reality. metrics refer to the tracking of (i) the accumulated reward,
and (ii) the rate of optimal and rejected partitionings.
A. EVALUATION ENVIRONMENT AND PERFORMANCE
METRICS B. COMPARISON WITH STATE-OF-THE-ART
We consider linear SFCs consisting of two to five VNFs (ex- We compare the performance of the (centralized) DDQL
cluding the src and dst auxiliary VNFs). Unless otherwise against the (distributed) DMARL algorithm over the mesh
specified, the algorithms are trained with datasets containing topology shown in Fig. 4a. Recall that DDQL serves as our
10, 000 SFCs (i.e., 10, 000 training episodes). Each VNF of state-of-the-art benchmark method. Instead of using existing
an SFC requires 5 − 20% of a server’s CPU. Each time a DDQL schemes for SFCP (e.g., [7], [11], [22]), we devise
new SFC requests partitioning, the physical resources reset a DDQL implementation (cf. Section III-C) which better
to arbitrary states. In particular, all previously embedded matches our problem formulation and objective, thus allow-
VNFs and virtual links are removed, and all servers generate ing for a fair comparison.
random CPU loads within 70 − 100%. That is, their available At first glance at Fig. 5a, where we plot the corresponding
CPU lies in 0 − 30%, and, in this way, we reduce the risk of scores as moving averages across the last 100 episodes, we
precluding feasible partitionings due to CPU insufficiency. observe that both methods perform similarly. In particular,
This approach (i) renders consecutive SFC placements (i.e., towards the end of the experiment, the two algorithms man-
episodes) independent events, and (ii) enables us to reason age to retrieve an average score of 7.0. Taking Eq. (4) into
about the actual learning efficiency of the algorithms, as low account, we infer that both learning schemes obtain SFCPs
scores will be solely due to insufficient learning. Further, that are, on average, 30% from the optimal. However, by
each PoP comprises ten servers, i.e., each PoP is a (micro) zooming in at the 4, 000 − 6, 000 episodes interval, it be-
datacenter. Regarding the bandwidth of inter-PoP physical comes apparent that the DMARL framework exhibits faster
links, we assume that it always suffices for virtual link convergence towards the aforementioned score than DDQL.
allocation in our current study, and aim to explore extensive In fact, this is further corroborated by Fig. 5b, where it
insights into the aspect of virtual node allocations. becomes clear that, at the exact same interval, the rate of
As illustrated in Fig. 4, we utilize three PoP topologies optimal partitionings of DMARL grows more rapidly than
VOLUME x, 2022 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

(a) Agent 0 (b) Agent 1 (c) Agent 2 (d) Agent 3 (e) Agent 4
FIGURE 6: Action selection rules identified by DMARL agents.
that of DDQL. According to Fig. 5b, the rejection rates are
negligible for both algorithms.
Although informative, Figs. 5a and 5b do not convey a lot
about the differences of the underlying algorithmic behav-
iors. To this end, we also consider Fig. 5c, which depicts the
footprint of optimal SFCPs. The most evident difference here
is centered on the incapacity of DDQL to achieve many 1 PoP
partitionings, which is counterbalanced by its higher capacity (a) Learning progress (b) Optimal and rejected SFCs
to achieve >2 PoP partitionings, compared to DMARL. Our
interpretation of this behavior is as follows: DDQL has a
partial view of the real state, but its observation contains
information about the entire topology (see Section III-C).
This enhances its inherent capacity to distribute VNFs across
many PoPs, since it can reason about the state of the entire
multi-datacenter system. At the same time, its observations
only include the average available CPU capacities of each
PoP (i.e., dS,t (u) in ot ), which limits its confidence in con-
solidating multiple VNFs into the same PoP. For instance, (c) Agent 1 (d) Agent 2
if dS,t (u) = 0.15 for the PoP u ∈ VS , and the current FIGURE 7: Cross-agent communication in the mesh topol-
VNF i demands dR (i) = 0.18, then it is impossible for ogy.
DDQL to be certain whether i fits into a server of PoP
u. Conversely, a DMARL agent α operating over u has a
detailed view of the available CPU resources of all servers
four key rules which, from a human perspective, seem rather
in PoP u, e.g., α observes (dS,t (u1 ) = 0.10, ..., dS,t (us ) =
intuitive for the SFCP problem. Specifically, the increase
0.20, ..., dS,t (u10 ) = 0.15), and, in this case, it can be certain
of p(0|0, 0, 0) implies that the agents tend to express low
that server s of PoP u can accommodate VNF i.
willingness for hosting the current VNF when all flags are
Although the aforementioned limitation can be mitigated
down, i.e., no other VNFs are placed in the associated
by extending DDQL’s observation space with additional
PoP (hosts_another = 0), the previous VNF is placed
metrics (at the risk of further slowing its convergence), we
in another PoP (hosts_previous = 0), and the associated
deem that this analysis indicates the implications of decision-
PoP is not in the shortest path from the src and the dst
making with incomplete information in the context of SFCP.
(in_shortest_path = 0). In contrast, a growing p(2|1, 1, 1)
means that agents tend to express high willingness to host
C. DMARL ANALYSIS
the current VNF when all flags are up. At the same time,
To further delve into the behavior developed by the DMARL the counter-intuitive p(2|0, 0, 0) (high willingness when all
scheme, we employ Fig. 6. Here, we attempt to shed light flags are down) and p(0|1, 1, 1) (low willingness when all
into the learning aspect of the team of agents in the mesh flags are up) seem to decrease over time. Our argument that
topology, by examining certain action selection rules that these rules are indeed learned is backed by the fact that the
have been identified over the course of training. To this end, respective probabilities diverge from the horizontal dotted
we monitor the evolution of p(x|f1 , f2 , f3 ) at every training line at p = 0.33, which indicates random action selection.
step, which shall be interpreted as the probability of taking Nevertheless, these lines never reach either 1.0 or 0.0, which
action x ∈ X = {0, 1, 2}, given that f1 = hosts_another, may imply that the additional observation dimensions (be-
f2 = hosts_previous, and f3 = in_shortest_path (recall sides the three flags) also play an important role in the action
that these are binary flags, defined in Section IV-C, and are selection.
part of the observation oα
t ).
As per Figs. 6a - 6e, every agent manages to discover
10 VOLUME x, 2022

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

(a) Learning progress (b) Optimal and rejected SFCs (a) Learning progress (b) Optimal and rejected SFCs

(c) Behavior of Agents 0 (c) Action selection (com. off) (d) Action selection (com. on)
FIGURE 8: Cross-agent communication in the star topology. FIGURE 9: Cross-agent communication in the linear topol-
ogy.
D. CROSS-AGENT COMMUNICATION
We now evaluate the impact of cross-agent communication path from the src to the dst, even if it does not host another
on the overall efficiency of the DMARL scheme across all VNF or the previous VNF. Note that the respective line for
three PoP topologies shown in Fig. 4. Therefore, we imple- DMARL with communication off is closer to random. This
ment a variant of the DMARL algorithm, where agents do corroborates the fact that enabled communication augments
not receive the neighbors’ condensed state (i.e., location and Agent 0 to perceive its position in the star topology and,
average available CPU in oα t ). hence, to potentially facilitate SFCPs as an intermediate PoP.
Fig. 7 summarizes our findings for the mesh topology Finally, we assess the cross-agent communication feature
(Fig. 4a). According to Figs. 7a and 7b, the DMARL scheme in the linear PoP topology shown in Fig. 4c. Here, results
with the communication feature on dominates the respective regarding the learning progress (Fig. 9a) and the rate of
scheme with communication off. In particular, the former optimal and rejected SFCPs (Fig. 9b) follow a similar trend
manages to converge faster towards an average score of 7.0, to that of the star topology, and the benefits of cross-agent
and its rate of optimal partitionings grows slightly quicker, communication remain. Some insightful micro-benchmarks
compared to the latter. To interpret this performance differ- regarding the difference of the two schemes are illustrated in
ence, we lay out Figs. 7c and 7d for DMARL with communi- Figs. 9c and 9d. Specifically, these bar charts indicate action
cation off, meant to be compared with Figs. 6b and 6c respec- selection frequencies per agent during the last 2, 000 episodes
tively, where DMARL employs cross-agent communication. (where agents are essentially purely exploitative, as ϵ → 0+ ).
We focus on Agent 1 and Agent 2, since they operate over Based on Fig. 9c, Agent 1 shows the most willingness to host
two pivotal PoPs in the mesh topology (i.e., PoPs 1 and 2 VNFs; however, this is contradicting to the linear topology in
are highly relevant for SFCs). From the comparison of the which we would expect Agent 2 to participate in more SFCPs.
respective p(x|f1 , f2 , f3 )-lines, we observe that, when cross- Therefore, Agent 2 cannot perceive its pivotal position in the
agent communication is disabled, action selection rules are linear topology when the communication is off. In contrast,
learned both slower and less firmly. in Fig. 9d, notice that action 2 frequencies are better aligned
Results from the respective comparison in the star topol- with the position of agents in the linear topology. Specifically,
ogy (Fig. 4b) are depicted in Fig. 8. In particular, Figs. 8a we observe that Agent 2 is the most willing to allocate its PoP
and 8b emphasize the superiority of the DMARL algorithm resources to VNFs, followed by its adjacent agents 1 and 3,
that uses cross-agent communication. Here, the former con- while agents 0 and 4 participate in less SFCPs, given their
verges both faster and to a higher average reward, and the outermost position in the linear topology. We also note that,
difference in optimal partitionings is close to 10%. It is worth in principle, when communication is disabled, agents tend to
noting that, we observed slower and less firm identification select action 1 (i.e., neutral position with respect to hosting
of action selection rules in this case also, but the respective a VNF) more frequently, which can be interpreted as lack of
results are omitted due to space limitations. Instead, we plot confidence in their learning capacity.
Fig. 8c, where we zoom into a behavioral aspect of Agent 0 According to the analysis above, cross-agent communi-
(since PoP 0 is a pivotal PoP in the star topology). Specif- cation of short, yet informative, messages (i) enables faster
ically, p(0|0, 0, 1) is lower for Agent 0 when cross-agent and firmer identification of action selection rules, and (ii)
communication is enabled, meaning that this agent is still enhances the agents’ ability to identify their position within
more likely to select actions 1 or 2 when it is in the shortest the topology and adjust their action selection behavior ac-
VOLUME x, 2022 11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

the communication on schemes, agent policies rely on cross-


agent communications, and if the latter are faulty, this has an
adverse effect on action selection. Further, the observation
dimensions in communication off schemes are always less
than the corresponding dimensions of communication on
schemes. In this sense, communication off is not equivalent
to communication on with l = 100%.
At first glance, the above results suggest that the pro-
posed DMARL scheme is sensitive to the loss of cross-
FIGURE 10: Imperfect cross-agent communication schemes. agent communications and would require ultra-reliable com-
munications of neighbors’ states. However, this is not fully
cordingly.
accurate. In fact, a variant of the above experiment - where
each agent knows perfectly the location of its neighbors (i.e.,
E. IMPERFECT CROSS-AGENT COMMUNICATION
v.loc, ∀v ∈ N (u) for PoP u), and the loss rate is only applied
The proposed DMARL scheme relies on the exchange of
to the neighbors’ average available CPU (i.e., dS,t (v), ∀v ∈
concise information among neighboring agents. Effectively,
N (u) for PoP u) - is conducted and only marginal differences
this is realized via cross-VIM/PoP communications (cf. step
are observed in terms of score compared to the case with
2 in Fig. 3). However, communications are hardly ever
ideal cross-agent communication. In conclusion, it is crucial
perfect. For example, data communication may be delayed,
for DMARL agents to be aware of the positions of their
packets may be lost, or erroneous data may appear owing to
neighbors, as this position awareness allows them to perceive
lossy (de)compression or channel fluctuation. Moreover, the
their own position within the topology more accurately and,
communication medium may not be dedicated for a single
in extension, to develop appropriate action selection behav-
service; therefore, the opportunities for information exchange
iors. This aspect, i.e., the significance of the content of the
between neighboring agents will be limited. Yet, if cross-PoP
exchanged information, opens up the opportunity to apply
communications are totally infeasible, one can resort to the
different quality of service (QoS) requirements to different
DMARL variant with the cross-agent communication feature
information of neighbors’ states.
off, whose behavior has been analyzed in Section V-D. De-
spite of some performance degradation, it still manages to
optimize SFCPs to a certain degree (cf. Figs. 7, 8, and 9). VI. RELATED WORK
To quantify the importance level of cross-agent commu- This section includes a comprehensive discussion on re-
nications, we evaluate our DMARL framework over lossy cent works that employ reinforcement learning techniques
inter-PoP links4 that exhibit non-zero packet loss rates l, for SFCP. In particular, we classify them into i) single-
(i.e., l > 0). That is, an agent α receives nothing from its domain SFCP, where all PoPs belong to the same provider,
neighbor agent β ∈ N (α) with probability l (in this case, the and ii) multi-domain SFCP, where PoPs belong to multiple
respective entries in oα t are filled with 0s), while it receives
providers.
the actual condensed state of agent β with probability 1 − l.
Here, we experiment with three levels of loss rate, i.e., A. RL IN SINGLE-DOMAIN SFCP
l = 0.5%, l = 1%, and l = 2%, over the mesh topology Most studies that tackle the SFCP problem in single-domain
(Fig. 4a). settings opt for centralized solutions. In our context, these
In Fig. 10, to better identify their performance differences, are centralized RL agents handling decision-making over the
we plot the respective scores as moving averages across the entire set of PoPs across the domain.
entire training period. The observations that can be drawn are In particular, Quang et al. [6] devise a deep determin-
as follows. First, smaller loss rates come with better DMARL istic policy gradient method based on the actor-critic RL
performance, while even l = 2% loss rate leads to half unit paradigm. In order to enhance the exploration capacity of
lower average score compared to the perfect communication their method, authors adopt a multiple critic networks ap-
scheme (i.e., where l = 0%). Another interesting outcome proach. These networks are intended to evaluate multiple
concerns the comparison of the imperfect communication noisy actions produced on the basis of the promo action
schemes with the communication off scheme. As per Fig. 10, selected by the actor network. At each time step, the true
disabling cross-agent communication (i.e., com: off) results action to be selected is the one with the largest mean Q-value,
in similar scores with the DMARL which operates over a computed across all critic networks. However, the actual
mesh topology with l = 1% loss rate, and even outperforms updates on the actor network are driven by the critic network
the scenario with l = 2%. That is because in the communica- that exhibited the lowest loss. It is worth noting that the above
tion off scheme, agents learn optimized policies without ever RL scheme merely computes placement priorities, and not
expecting information from their neighbors. Conversely, in actual VNF allocations. The latter are handled by a heuristic
4 We here focus on the loss of neighbors’ state information, rather than the termed as heuristic fitting algorithm, which maps VNFs
packet loss of VNF data traffic. onto physical nodes in a greedy fashion based on rankings
12 VOLUME x, 2022

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

and, subsequently, utilizes Dijkstra’s method to compute link ing VNFs. Second, our DMARL scheme can indeed claim
assignments. decision-making with complete information, since, from the
Pei et al. [7] propose an SFCP framework which leverages perspective of the team of agents, the environment is fully
DDQL. Here, the decision-making process is subdivided observable. This is not the case with many studies, which
into three steps. First, given a network state, a preliminary either define incomplete representations of the true state or
evaluation of each action is performed, i.e., computation of assume very detailed knowledge at the orchestration layer
respective Q-values. Out of all actions, all but the k best are (e.g., [4], [6]–[8], [10], [11]). Third, the action spaces used
discarded. In the second step, these k actions are executed by individual DDQL agents of the DMARL framework are
in simulation-mode, and the actual reward and next states topology-agnostic (as opposed to e.g., [4]–[11]), which al-
are observed. Finally, the action with the highest reward is leviates scalability limitations and renders our framework a
performed on the physical infrastructure. promising candidate for dynamic topologies.
Zheng et al. [8] investigate the deployment of SFCs in the
context of cellular core. To this end, authors put both the B. RL IN MULTI-DOMAIN SFCP
intra-datacenter and the intra-server placement problem into Multi-domain SFCP is commonly solved via distributed
the frame. Focusing our discussion on the intra-datacenter methods [17]. This can be attributed to the limited infor-
placement, authors devise an approximation algorithm for the mation sharing among different providers, which hinders the
optimized SFCP, under the assumption that the resource re- possibility of solving SFCP in a centralized fashion.
quirements and the lifespan of each SFC are fully disclosed. In [28], authors assign a DDPG-based RL agent to each
However, given the strictness of this hypothesis, authors domain. Further, the SFC request is encoded by the client
implement an SFCP scheme that can cope with demand (i.e., the entity that wants to deploy the SFC) and conveyed to
uncertainty, as well. In particular, the proposed method is each RL agent. The latter then treat the SFC encoding as state
conventional Q-learning. Here, a state st is expressed as and compute an action, which is effectively the bidding price
the SFC type that needs to be deployed at time t, while an for renting out their resources to the SFC request. Prices are
action xt represents the selection of a server. Nonetheless, accumulated by the client, who opts for the best combination
conventional Q-learning is not a viable option when the state- based on a cost-based first fit heuristic. The MARL setting
action space is vast. here is inherently competing, as agents do not have any
Wang et al. [11] and Jia et al. [12] investigate SFCP incentive to maximize the rewards of other agents.
through the lens of fault-tolerance, essentially considering Toumi et al. [22] propose a hierarchical MARL scheme
the deployment of redundant VNFs or entire SFCs which which employs a centralized agent (i.e., multi-domain or-
shall be engaged in the event of malfunction in the run- chestrator - MDO) on top of multiple local-domain agents,
ning SFC. In particular, Wang et al. [11] devise a DDQL where all agents are implemented with the DDQL architec-
algorithm that is trained to compute an ordered pair (u, v) ture. The MDO receives the SFC request, and decides which
where u is the index of the DC for proper deployment, and local domain will host each constituent VNF. Afterwards,
v the backup DC where the standby SFC instance will be the agents of local domains are responsible for placing the
deployed. Effectively, this implies that the entire SFCs will sub-SFCs within their own nodes. Admittedly, the proposed
be placed in a single PoP. Jia et al. [12] decompose the SFC MDO is quite similar to our own DDQL implementation
scheduling problem with reliability constraints into two sub- for SFCP. If we assume that the partial observation of our
problems. First, they use a heuristic to determine the number DDQL due to data aggregation is equivalent to the partial
of redundant VNF instances (per VNF). Then, they employ observation of the MDO due to limited information sharing
an A3C algorithm which, in conjunction with a rule-based across multiple domains, then we can think of the comparison
node selection approach, facilitates the process of mapping in Section V-B as a comparison of DMARL with MDO5 .
these instances onto compute nodes. In practice, the DRL In [29], Zhu et al. devise a MARL scheme for SFCP over
algorithm learns whether to defer or not the deployment of IoT networks. Specifically, they propose a hybrid architec-
a VNF, instead of learning where to map it. However, given ture that employs both centralized training and distributed
such a topology-independent action space, the proposed DRL execution strategies. Here, the SFCP problem is modeled as
scheme is not sensitive to evolving topologies. a multi-user competition game model (i.e., Markov game) to
account for users’ competitive behavior. In their implemen-
In contrast to the above, our work promotes single-domain tation, the centralized controller performs global information
SFCP in a decentralized, yet coordinated fashion, addressing statistical learning, while each user deploys service chains in
various limitations of existing studies. First, the proposed a distributed manner, guided by a critic network.
DMARL scheme empowers local controllers (i.e., VIMs)
with the ability to express their own resource utilization While multi-domain SFCP is vertical to single-domain SFCP
intents based on detailed local observations (as opposed to on the grounds of information disclosure, the related so-
e.g., [4]–[12]). This is explicitly achieved by designating 5 We acknowledge that these two sources of partial observability are not
one DDQL agent per VIM, and by utilizing action sets equivalent, hence a direct comparison of DMARL with MDO makes little
that express the degree of willingness with respect to host- sense.

VOLUME x, 2022 13

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

lutions exhibit some similarities to the proposed DMARL action selection policies.
scheme. In effect, agents voting on hosting VNFs can be While this work sheds light on numerous fundamental
paralleled to agents bidding for resources, while our coor- aspects of distributed resource allocation within NFV, at
dination module can be seen as a centralized broker which the same time it paves the way for several new research
aggregates votes and computes the best action. However, the directions. For instance, having established the perks of de-
intrinsic competition that underlies multi-domain scenarios centralized SFCP, a natural next step is to examine a multi-
cannot be overlooked. For example, notice how agents in agent RL algorithm at which agents shall learn what, when, to
multi-domain SFCP strive to maximize their own rewards, whom and how to communicate; in other words, agents shall
disregarding the rewards of others. As such, our DMARL learn their own cross-agent communication protocol.
algorithm cannot be directly compared to any solution per-
taining to the above. REFERENCES
[1] M. Rost and S. Schmid, “On the hardness and inapproximability of virtual
network embeddings,” IEEE/ACM Transactions on Networking, vol. 28,
VII. DISCUSSION no. 2, pp. 791–803, 2020.
Our work exhibits certain shortcomings, which we deem im- [2] D. Dietrich, A. Abujoda, A. Rizk, and P. Papadimitriou, “Multi-provider
portant to discuss and clarify. With respect to the underlying service chain embedding with nestor,” IEEE Transactions on Network and
Service Management, vol. 14, no. 1, pp. 91–105, 2017.
algorithmic framework we opted for, which is IQL, we have [3] A. Pentelas, G. Papathanail, I. Fotoglou, and P. Papadimitriou, “Network
already acknowledged its lack of convergence guarantees. service embedding across multiple resource dimensions,” IEEE Transac-
The natural next step is to examine a DMARL algorithm for tions on Network and Service Management, vol. 18, no. 1, pp. 209–223,
2020.
SFCP based on centralized training - decentralized execution [4] S. Haeri and L. Trajković, “Virtual network embedding via monte carlo
schemes, such as the ones proposed in [25] and [26], where tree search,” IEEE transactions on cybernetics, vol. 48, no. 2, pp. 510–
finding the optimal joint policy is theoretically guaranteed 521, 2017.
[5] Y. Xiao, Q. Zhang, F. Liu, J. Wang, M. Zhao, Z. Zhang, and J. Zhang,
(under certain assumptions). Additional limitations which we “Nfvdeep: Adaptive online service function chain deployment with deep
identify mostly pertain to the analysis of our method, rather reinforcement learning,” in Proceedings of the International Symposium
on the method itself. In particular, we evaluate DMARL over on Quality of Service, 2019, pp. 1–10.
[6] P. T. A. Quang, Y. Hadjadj-Aoul, and A. Outtagarts, “A deep reinforcement
relatively small multi-PoP topologies. Yet, the intention of learning approach for vnf forwarding graph embedding,” IEEE Transac-
this work is to explore in-depth the learning behavior that tions on Network and Service Management, vol. 16, no. 4, pp. 1318–1331,
is developed by a team of independent RL agents in order to 2019.
[7] J. Pei, P. Hong, M. Pan, J. Liu, and J. Zhou, “Optimal vnf placement via
cooperatively solve SFCP. To this end, keeping the topologies deep reinforcement learning in sdn/nfv-enabled networks,” IEEE Journal
small, not only allows us to delve easier into the behavior of on Selected Areas in Communications, vol. 38, no. 2, pp. 263–278, 2019.
individual agents, but to demonstrate the respective findings [8] J. Zheng, C. Tian, H. Dai, Q. Ma, W. Zhang, G. Chen, and G. Zhang,
“Optimizing nfv chain deployment in software-defined cellular core,”
as well (e.g., Fig. 6). IEEE Journal on Selected Areas in Communications, vol. 38, no. 2, pp.
248–262, 2019.
VIII. CONCLUSIONS [9] R. Solozabal, J. Ceberio, A. Sanchoyerto, L. Zabala, B. Blanco, and
F. Liberal, “Virtual network function placement optimization with deep
Irrespective of the above constraints, both IQL and our eval- reinforcement learning,” IEEE Journal on Selected Areas in Communica-
uation methodology enable us to obtain valuable knowledge. tions, vol. 38, no. 2, pp. 292–303, 2019.
Specifically, the comparison in Section V-B indicates that [10] Z. Yan, J. Ge, Y. Wu, L. Li, and T. Li, “Automatic virtual network embed-
ding: A deep reinforcement learning approach with graph convolutional
DMARL outperforms a centralized state-of-the-art SFCP networks,” IEEE Journal on Selected Areas in Communications, vol. 38,
method based on DDQL, which we mainly attribute to the no. 6, pp. 1040–1057, 2020.
full observability of the former against the partial observabil- [11] L. Wang, W. Mao, J. Zhao, and Y. Xu, “Ddqp: A double deep q-learning
approach to online fault-tolerant sfc placement,” IEEE Transactions on
ity of the latter. Concretely, DDQL fails to consolidate VNFs Network and Service Management, vol. 18, no. 1, pp. 118–132, 2021.
to the extend DMARL does, given the fact that it is not aware [12] J. Jia, L. Yang, and J. Cao, “Reliability-aware dynamic service chain
of the exact server capacities. We further corroborate that the scheduling in 5g networks based on reinforcement learning,” in IEEE IN-
FOCOM 2021-IEEE Conference on Computer Communications. IEEE,
team of agents manages to recognize certain intuitive action 2021, pp. 1–10.
selection rules in Section V-C. This is highly crucial, espe- [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
cially considering efforts towards machine-learning explain- Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al.,
“Human-level control through deep reinforcement learning,” nature, vol.
ability. Moreover, we identify that disabling the exchange 518, no. 7540, pp. 529–533, 2015.
of concise messages among neighboring agents limits their [14] D. Dietrich, A. Rizk, and P. Papadimitriou, “Multi-domain virtual network
ability to perceive their position within the multi-PoP system, embedding with limited information disclosure,” in 2013 IFIP Networking
Conference. IEEE, 2013, pp. 1–9.
which has a negative effect on the developed action selection [15] P. T. A. Quang, A. Bradai, K. D. Singh, and Y. Hadjadj-Aoul, “Multi-
behaviors (Section V-D). Last, we quantify the performance domain non-cooperative vnf-fg embedding: A deep reinforcement learning
drop of DMARL over multi-PoP systems which exhibit non- approach,” in IEEE INFOCOM 2019-IEEE Conference on Computer
Communications Workshops (INFOCOM WKSHPS). IEEE, 2019, pp.
zero packet loss rates (Section V-E). The key takeaway here 886–891.
is that it is highly important for DMARL agents to at least be [16] N. Toumi, M. Bagaa, and A. Ksentini, “Hierarchical multi-agent deep
aware of the position of their neighbors, which implies that reinforcement learning for sfc placement on multiple domains,” in 2021
IEEE 46th Conference on Local Computer Networks (LCN). IEEE, 2021,
location coordinates play an important role in the individual pp. 299–304.

14 VOLUME x, 2022

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576

[17] J. C. Cisneros, S. Yangui, S. E. P. Hernández, and K. Drira, “A survey on Danny De Vleeschauwer (Member, IEEE) received the M.Sc. degree in
distributed nfv multi-domain orchestration from an algorithmic functional electrical engineering and the Ph.D. degree in applied sciences from Ghent
perspective,” IEEE Communications Magazine, vol. 60, no. 8, pp. 60–65, University, Belgium, in 1985 and 1993, respectively. He currently is a
2022. DMTS in the Network Automation Department of the Network Systems and
[18] F. Carpio, S. Dhahri, and A. Jukan, “Vnf placement with replication for Security Research Lab of Nokia Bell Labs, in Antwerp, Belgium. Prior to
loac balancing in nfv networks,” in 2017 IEEE international conference joining Nokia, he was a Researcher in Ghent University. His early work was
on communications (ICC). IEEE, 2017, pp. 1–6. on image processing and on the application of queuing theory in packet-
[19] A. Pentelas and P. Papadimitriou, “Network service embedding for cross- based networks. His current research interest includes the distributed control
service communication,” in 2021 IFIP/IEEE International Symposium on of applications over packet-based networks.
Integrated Network Management (IM). IEEE, 2021, pp. 424–430.
[20] ——, “Service function chain graph transformation for enhanced resource
efficiency in nfv,” in 2021 IFIP Networking Conference (IFIP Network- Chia-Yu Chang (Member, IEEE) received the Ph.D. degree from Sorbonne
ing). IEEE, 2021, pp. 1–9. Université, France, in 2018. He is currently a Network System Researcher at
[21] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with Nokia Bell Labs. He has more than 12 years of experience in algorithm/pro-
double q-learning,” in Proceedings of the AAAI conference on artificial tocol research on communication systems. He has extensive research expe-
intelligence, vol. 30, no. 1, 2016. rience at both academic and industrial laboratories, including EURECOM
[22] N. Toumi, M. Bagaa, and A. Ksentini, “Hierarchical multi-agent deep Research Institute, MediaTek, Huawei Swedish Research Center, and Nokia
reinforcement learning for sfc placement on multiple domains,” in 2021 Bell Labs. He has participated in several European Union’s collaborative
IEEE 46th Conference on Local Computer Networks (LCN). IEEE, 2021, research and innovation projects, such as COHERENT, SLICENET, 5G-
pp. 299–304. PICTURE, 5Growth, and DAEMON. His research interests include wireless
[23] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative communication, computer networking, edge computing, with particular in-
agents,” in Proceedings of the tenth international conference on machine terest in network slicing, traffic engineering, and AI/ML supported network
learning, 1993, pp. 330–337. control.
[24] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent reinforce-
ment learners in cooperative markov games: a survey regarding coordina-
tion problems,” The Knowledge Engineering Review, vol. 27, no. 1, pp. Koen De Schepper received the M.Sc. degree in industrial sciences (elec-
1–31, 2012. tronics and software engineering) from IHAM Antwerpen, Belgium. He
[25] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, joined Nokia (then Alcatel), in 1990, where during the first 18 years, he
M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls et al., was Platform Development Leader and Systems Architect. He has been
“Value-decomposition networks for cooperative multi-agent learning,” with Bell Labs for the past 14 years, currently in the Network Systems
arXiv preprint arXiv:1706.05296, 2017. and Security Research Laboratory. Before, he worked mainly on transport
[26] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and layer protocols (L4) and their network support for scalable (SCAP) and
S. Whiteson, “Qmix: Monotonic value function factorisation for deep low latency (L4S) content delivery. His current research interests include
multi-agent reinforcement learning,” in International Conference on Ma- programmable data plane and traffic management, customizable network
chine Learning. PMLR, 2018, pp. 4295–4304. slicing, and AI supported dynamic service and network control.
[27] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli,
and S. Whiteson, “Stabilising experience replay for deep multi-agent
reinforcement learning,” in International conference on machine learning. Panagiotis Papadimitriou (Senior Member, IEEE) is an Associate Profes-
PMLR, 2017, pp. 1146–1155. sor at the department of Applied Informatics in the University of Macedonia,
[28] P. T. A. Quang, A. Bradai, K. D. Singh, and Y. Hadjadj-Aoul, “Multi- Greece. Before that, he was an Assistant Professor at the Communications
domain non-cooperative vnf-fg embedding: A deep reinforcement learning Technology Institute of Leibniz Universität Hannover, Germany, and a
approach,” in IEEE INFOCOM 2019-IEEE Conference on Computer member of L3S research center in Hanover. Panagiotis received a Ph.D.
Communications Workshops (INFOCOM WKSHPS). IEEE, 2019, pp. in Electrical and Computer Engineering from Democritus University of
886–891. Thrace, Greece, in 2008, a M.Sc Information Technology from University of
[29] Y. Zhu, H. Yao, T. Mai, W. He, N. Zhang, and M. Guizani, “Multiagent Nottingham, UK, in 2001, and a B.Sc. in Computer Science from University
reinforcement-learning-aided service function chain deployment for inter- of Crete, Greece, in 2000. He has been a (co-)PI in several EU-funded (e.g.,
net of things,” IEEE Internet of Things Journal, vol. 9, no. 17, pp. 15 674– NEPHELE, T-NOVA, CONFINE, NECOS) and nationally-funded projects
15 684, 2022. (e.g., G-Lab VirtuRAMA, MESON). Panagiotis was a recipient of Best
Paper Awards at IFIP WWIC 2012, IFIP WWIC 2016, and the runner-
up Poster Award at ACM SIGCOMM 2009. He has co-chaired several
international conferences and workshops, such as IFIP/IEEE CNSM 2022,
IFIP/IEEE Networking TENSOR 2021–2020, IEEE NetSoft S4SI 2020,
Angelos Pentelas (Student Member, IEEE) obtained the B.Sc. in Mathemat- IEEE CNSM SR+SFC 2018–2019, IFIP WWIC 2017–2016, and INFOCOM
ics from the Aristotle University of Thessaloniki, Greece, and the M.Sc. in SWFAN 2016. Panagiotis is also an Associate Editor of IEEE Transactions
Applied Informatics from the University of Macedonia, Greece. Currently, on Network and Service Management. His research activities include (next-
he is pursuing the Ph.D. degree at the same department. From Oct ’21 to generation) Internet architectures, network processing, programmable data-
Mar ’22, he has been with Nokia Bell Labs, Belgium, as a Ph.D. Intern. His planes, time-sensitive networking (TSN), and edge computing.
research is centered on decision-making methods for network orchestration.

VOLUME x, 2022 15

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like