Deep Multi-Agent Reinforcement Learning With Minim
Deep Multi-Agent Reinforcement Learning With Minim
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.DOI
ABSTRACT Network Function Virtualization (NFV) decouples network functions from the underlying
specialized devices, enabling network processing with higher flexibility and resource efficiency. This
promotes the use of virtual network functions (VNFs), which can be grouped to form a service function
chain (SFC). A critical challenge in NFV is SFC partitioning (SFCP), which is mathematically expressed as
a graph-to-graph mapping problem. Given its NP-hardness, SFCP is commonly solved by approximation
methods. Yet, the relevant literature exhibits a gradual shift towards data-driven SFCP frameworks, such as
(deep) reinforcement learning (RL).
In this article, we initially identify crucial limitations of existing RL-based SFCP approaches. In particular,
we argue that most of them stem from the centralized implementation of RL schemes. Therefore, we
devise a cooperative deep multi-agent reinforcement learning (DMARL) scheme for decentralized SFCP,
which fosters the efficient communication of neighboring agents. Our simulation results (i) demonstrate
that DMARL outperforms a state-of-the-art centralized double deep Q-learning algorithm, (ii) unfold the
fundamental behaviors learned by the team of agents, (iii) highlight the importance of information exchange
between agents, and (iv) showcase the implications stemming from various network topologies on the
DMARL efficiency.
VOLUME x, 2022 1
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
etc. Yet, useful information from multiple VIMs shall be con- resource allocation intents. In fact, global and local intents
veyed to the NFVO layer, since certain actions of the latter are not necessarily conflicting; on the contrary, fostering the
might require the coordination of more than a single VIM. acknowledgement of local intents can improve the resource
This is the case for the SFC partitioning (SFCP) problem, allocation efficiency of the overall system, since each local
where the diverse location constraints of VNFs raises the controller has a more precise view of its actual state.
need for their distribution across multiple VIMs.
From an algorithmic point of view, SFCP is proven to Decision-making with incomplete information. The com-
be an NP-hard optimization problem [1], which practically mon assumption that a single learning agent, positioned at
implies scalability limitations. As being typical in problems the NFVO layer (which is the standard practice in centralized
of such computational complexity, many studies model SFCP SFCP with RL), has a precise view over the entire topology,
as a mixed-integer linear program (MILP), and subsequently is somehow unrealistic. In fact, the higher we move along the
propose heuristics or approximation algorithms that can cope NFV MANO hierarchy, the more coarse-grained the infor-
with the scalability issue at the expense of solution quality mation we have at our disposal (because of data aggregation),
(e.g., [2], [3]). However, these algorithms are usually tailored which implies higher uncertainty. Conversely, if a centralized
to specific environments (e.g., certain network topologies), RL approach admits partial observability, it follows that the
constraints and objectives, thereby requiring drastic redesign decision-making process relies on incomplete information.
in case some problem parameters or goals alter. Additionally, Dependence of action space on topology nodes. In most
they generally assume perfect knowledge with regard to the works that propose centralized RL schemes for the SFCP
state of the physical network, which is highly unrealistic due problem, the action space of the learning agent coincides
to high network dynamics. with the set of available points-of-presence (PoPs). This is
To overcome these limitations, data-driven resource allo- somewhat natural, since the centralized agent has to infer
cation techniques, primarily based on reinforcement learning which is the best PoP to host a particular VNF. However,
(RL), have recently started to gain traction within the prob- the dependence of the agent’s architecture on the physical
lem space of SFCP (e.g., [4]–[12]). This algorithmic shift has nodes of the substrate makes the RL scheme particularly
its grounds on several arguments. First, rapid advancements hard to scale. That is, the larger the action set, the longer the
in the RL field and, in particular, the application of deep training time. Our argument is further strengthened by studies
neural networks within the RL framework [13], empowers which employ techniques for shrinking the action space, e.g.,
it to handle vast state spaces. Prior to this development, RL clustering a set of points-of-presence (PoPs) into a single
methods were unable to cope with large environments since group, or placing the entire SFC within a single PoP (e.g., [7],
quantifying each state-action combination required testing it [11]).
multiple times. Second, RL methods are inherently purposed
to design their own optimization strategies (i.e., policies).
B. CONTRIBUTIONS
They achieve this merely by interacting with the environment
upon which they are applied and by receiving a problem- Admittedly, targeting for a single solution that addresses
specific reward signal that expresses the decision quality. all of the aforementioned limitations would be extremely
Given that these features are aligned with the zero-touch optimistic. However, one can easily observe that most short-
network automation endeavour set out by operators, it is comings stem from the centralized implementation of SFCP.
only natural that RL algorithms have become state-of-the-art Indeed, a decentralized approach that assigns learning agents
SFCP methods. locally to each VIM, in conjunction with a module at the
NFVO layer that coordinates the resulting local decisions,
A. LIMITATIONS OF CENTRALIZED SFCP WITH RL would alleviate many of these limitations. In particular, (i)
the global and local intents would be naturally decoupled, (ii)
With regard to centralized and RL-driven SFCP, the relevant
each agent would act based on detailed local observations;
literature exhibits certain limitations, the most crucial of
thus, from the perspective of the overall system, the true
which are listed and analyzed below:
state would be fully observable, and (iii) it would be easier
Association of intents. A critical limitation of centralized to decouple action spaces from the physical topology. Along
SFCP with RL is the association of the resource alloca- these lines, this work focuses on the investigation, the devel-
tion intents of the NFVO with the respective intents of the opment, and the evaluation of a decentralized SFCP scheme
underlying VIMs. That is, as it is common in hierarchical based on cooperative deep multi-agent reinforcement learn-
resource management systems with global and local con- ing (DMARL). Our main contributions are the following:
trollers (e.g., NFVOs and VIMs, VIMs and hypervisors, k8s
master and worker nodes), the global controller queries the • We devise a DMARL framework for decentralized
local controllers about their available resources, and uses this SFCP. Our DMARL algorithm consists of independent
information to compute a resource allocation decision (which double deep Q-learning (DDQL) agents, and fosters the
is eventually realized by the latter). Effectively, this limits the exchange of concise, yet critical, information among
inherent capacity of the local controllers to express their own them.
2 VOLUME x, 2022
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
nates of the SFC, while t expresses its arrival time, and ∆t An important concept here is the action-value function,
denotes its lifespan. which quantifies the agent’s incentive of performing a spe-
We note that both src and dst are modeled as auxiliary cific action on a particular state:
VNF nodes with zero resource requirements (i.e., src, dst ∈ "∞ #
VR , and dR (src) = dR (dst) = 0). Naturally, this introduces X
two additional edges to the SFC graph, one from the src Q(st , xt ) := Et γ k Rt+k |st , xt , (st , xt ) ∈ S × X
k=0
towards the first VNF, and one from the last VNF towards the (1)
dst (see top of Fig. 3). Further, we assume that the src and
dst of an SFC correspond to coordinates of arbitrary PoPs, Apparently, if the agent explores its action selection on
denoted by usrc and udst ∈ VS . the environment long enough such that Q-values represent
ground truth reward values, then finding the optimal policy
Substrate network model. We model a substrate network becomes trivial. That is, π ∗ (st ) = argmaxx∈X Q(st , x).
with an undirected graph GS = (VS , ES ), where VS and ES However, Q-values are treated as estimates, since the as-
denote the sets of nodes and edges. Since we consider GS sumption of perfect exploration is hardly ever realistic. To
to be a PoP-level topology (i.e., a network of datacenters), this end, the agent needs to balance exploration and ex-
VS corresponds to PoPs, and ES to inter-PoP links. dS,t (u) ploitation, where the former shall strengthen its confidence
expresses the available CPU of node u at time t (i.e., the on Q estimations, and the latter will enable it to benefit
average available CPU of servers belonging to PoP u), and from accumulated knowledge. The most common approach
dS,t (u, v) denotes the available bandwidth of the edge (u, v) to achieve a favourable exploration-exploitation trade-off is
at time t (again, as percentages). Last, D(u, v) represents the via an ϵ − greedy action selection strategy. According to
propagation delay of a physical link (u, v). it, the agent performs a random action with probability ϵ,
while with probability 1 − ϵ the action with the highest Q-
III. SINGLE-AGENT REINFORCEMENT LEARNING value is selected. Given an initial ϵ, i.e., ϵ0 , and a decay
This section discusses single-agent RL optimization. We factor ϵdecay ∈ (0, 1), the agent can gradually shift from
initially outline the common RL setting, and proceed with exploratory to exploitative:
a short description of the state-of-the-art DDQL algorithm.
Subsequently, we elaborate on our proposed DDQL approach
ϵt = ϵ0 · (ϵdecay )t (2)
for SFCP.
B. DOUBLE DEEP Q-LEARNING
A. BACKGROUND A well-known method that leverages the notions above is Q-
The typical RL setting considers a learning-agent, an envi- learning, which is a model-free1 , off-policy2 RL algorithm.
ronment, a control task subject to optimization, and a discrete Q-learning attempts to find optimal policies via a direct
time horizon H = {1, 2, ...}, which can as well be infinite. estimation of Q-values, which are maintained in a tabular
At time t ∈ H, the agent observes the current state of the format. Each time a state-action pair (st , xt ) is visited, the
environment st ∈ S, with S indicating the entire set of respective Q-value is updated. An apparent restriction here is
possible states. Equipped with a set of actions X, the agent the lack of generalization capacity. That is, in problems with
shall interact with the environment by choosing xt ∈ X. large state-action spaces, the risk of scarce Q-updates is high.
Then, the agent receives feedback regarding the quality of its To overcome this limitation, Mnih et al. [13] propose a
action via a reward signal rt , and the environment transitions Q-function approximation scheme based on deep neural net-
to the subsequent state st+1 ∈ S. works (NNs), commonly termed as deep Q-learning (DQL).
Conventionally, a RL task is modelled as a Markov deci- As per DQL, the learning capacity of the agent lies in a NN
sion process, defined by the tuple < S, X, S1 , T, R >. Here, which approximates the Q-function by parameterizing it, i.e.,
S and X are as before. S1 ∈ P (S) denotes the starting Q(s, x) ∼ Q(s, x; θ) (where θ represents a vector of weights
state distribution, and T : S × X → P (S) is the state and biases of the NN). Ultimately, given a state as input, the
transition function, where P (·) denotes the set of probability NN computes a Q-value for every possible action. Then, it is
distributions over a set. Finally, R : S×X×S → R expresses up to an action selection strategy (e.g., ϵ − greedy) to choose
the reward function, which is generally denoted as Rt instead a particular action.
of R(st , xt , st+1 ) for simplicity. Naturally, the aim of the Two mechanisms that contribute to the success of DQL
agent isPto maximize the accumulated discounted reward over are the replay memory and the coordination of an online
H, i.e., t∈H γ t Rt , where γ ∈ [0, 1) is a discount factor em- NN and a target NN [21]. The replay memory is used to
ployed to prioritize immediate reward signals against those store state-action-reward-next state transitions; with a proper
projected farther into the future. To achieve that, the agent sampling over it, we can subsequently train the online NN
needs to perform a sequence of actions based on a learned with uncorrelated data. The interplay of the two NNs works
policy π, which is practically a function that maps states to 1 It does not intend to discover the transition function.
probability distributions over actions, i.e., π : S → P (X), 2 While following a specific policy, it assesses the quality of a different
and the optimal policy is denoted as π ∗ . one.
4 VOLUME x, 2022
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
FIGURE 2: Illustration of the typical DDQL architecture. Notice that, since ot does not hold information about the in-
dividual servers of the topology (which are the elements that
as follows: initially, the target NN is an exact replica of the actually host VNFs), the environment is partially observable,
online NN, meaning that they share the same architecture, hence we use observation ot instead of state st .
weights and biases. However, it is only the online NN that is Actions. In our centralized implementation, an action deter-
updated at every training step, while its weights and biases mines which PoP will host the current VNF i. Concretely,
are copied to the target NN every δ training steps. The target the agent’s action set is X = {1, 2, ..., |VS |}. Notice that,
NN (represented by θ− ) is used to temporarily stabilize the if decision steps referred to the placement of an entire SFC
target value which the online NN (represented by θ) tries to instead of individual VNFs, then the action space would have
predict. Effectively, the training of the learning agent boils been substantially larger (i.e., |X| = |VS ||VR | ). This would
down to the minimization of Eq. (3), also known as loss have an adverse effect on the algorithm’s performance.
function. In the target term of Eq. (3), the greedy policy
is evaluated by the online NN, whereas the greedy policy’s Reward. A VNF placement is deemed successful, if the
value is evaluated by the target NN. As per [21], this reduces selected PoP has at least one server with adequate resources
the overestimation of Q-values, which would be the case if to host it. An SFC placement (which is the ultimate goal) is
both the greedy policy and its value have been evaluated by deemed successful, if all of its VNFs have been successfully
the online NN. An illustration of the DDQL framework is placed and all of its virtual links are assigned onto physical
given in Fig. 2. paths that connect the VNFs correctly. For every successful
VNF placement, the agent receives rt = 0.1. If the current
VNF is the last VNF of the chain (i.e., terminal VNF) and
h
L(θt ) = E rt + γQ st+1 , argmaxQ(st+1 , x; θt ); θt−
x∈X is successfully placed, then the virtual link placement com-
| {z
target
} mences (using Dijkstra’s shortest path algorithm for every
(3) adjacent VNF pair - similar to [6] and [10]). If this process is
2 i successfully completed, the reward is computed as follows:
− Q(st , xt ; θt )
| {z }
predicted |opt_path| + 1
C. DDQL FOR SFCP rt = 10 · (4)
We devise a single-agent RL algorithm for SFCP to serve as |act_path| + 1
a baseline for comparison with our multi-agent RL scheme. where |opt_path| is the length of the shortest path between
Specifically, we opt for the DDQL method, as it has demon- the src.loc and the dst.loc (recall that these are always
strated promising results within our problem space (e.g., [7], associated with PoP locations), and |act_path| is the length
[11], [22]). In our implementation, a training episode consists of the actual path established by the DDQL algorithm. Ap-
of the placement of an entire SFC, whereas a decision step parently, the best reward rt = 10 is given when |opt_path| =
refers to the placement of a single VNF of the SFC. |act_path|. In case the placement of any VNF or virtual link
Observations. At time t, the agent observes information ot fails (due to inadequate physical resources), then rt = −10
about the current VNF i and the SFC GR that it belongs and a new training episode initiates.
to, as well as information about the available resources of Architecture. As hinted by previous discussions, the func-
the substrate physical topology GS . In particular, the agent tionality of our DDQL agent relies on four key elements,
receives the length of the SFC (|VR |), the coordinates of namely an online NN, a target NN, a replay memory, and an
the src (src.loc) and dst (dst.loc) nodes of the SFC, the action selection strategy. Both NNs comprise an input layer
CPU demand of VNF i (dR (i)), the ingress (dR (i− , i)) and whose size equals the length of the observation vector ot
egress (dR (i, i+ )) bandwidth requirements of VNF i, as well (i.e., |ot | = 3|VS | + |ES | + 9), two fully-connected hidden
as its order (i.order) in the SFC. Regarding the physical layers with 256 neurons each, and an output layer with |VS |
network, the agent observes the coordinates (u.loc) and the neurons. All layers are feed-forward, the activation function
average available CPU (dS,t (u)) of each PoP u ∈ VS , and applied on individual cells is Rectified Linear Unit (ReLU),
the available bandwidth (dS,t (u, v)) of each link (u, v) ∈ ES . and the loss function used is Eq. (3). The replay memory
That is: is implemented as a queue, which stores the latest 10, 000
VOLUME x, 2022 5
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
6 VOLUME x, 2022
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
is known as independent Q-learning (IQL) [23]. A major and β over v). Effectively, α receives the location coordinates
limitation of IQL is its lack of convergence guarantees, pri- and the average CPU capacity from every PoP v managed by
marily stemming from the fact that the environment is non- a neighboring agent β ∈ N (α). As it will become apparent,
stationary from the perspective of each agent. Effectively, this cross-agent communications are pivotal for the efficiency of
means that for an agent α, both its reward rt and its next the system, since they enable agents to reason about the
observation oα t+1 are not solely conditioned on its current state of other agents. Yet, to maintain the communication
observation oα α
t and action xt . In other words, the transitions overhead low, agents do not share their entire local state (i.e.,
of the environment, and, in extension, the common rewards, the available CPU of each server - dS,t (us ), ∀s ∈ u), rather
are affected by the actions of other agents as well. Even a single descriptive value (i.e., the average available CPU
though obtaining an optimized joint policy π while agents across all servers - dS,t (u)).
treat other agents as part of the environment is seemingly As explained in Section IV-B, agents within MARL set-
difficult, IQL exhibits good empirical performance [24]. tings operate over non-stationary environments, which prac-
Cooperative MARL is an active research field, given that tically implies that old experiences might become highly
numerous control tasks can be seen through the lens of irrelevant as agents shift from exploratory to exploitative. To
multi-agent systems. In the context of collaborative DQL this end, we include a fingerprint [27] into the observation,
agents, recent works (e.g., [25] and [26]) have established namely ϵ (the probability to select a random action), as a
frameworks that promote joint training of multiple agents means to distinguish old from recent experiences.
(i.e., centralized training - decentralized execution), under The last elements that are inserted into oα t are three bi-
the assumption that the Qtotal -function of the multi-agent nary flags, namely hosts_another, hosts_previous, and
system can be decomposed into individual Q-functions such in_shortest_path, which are computed by agent α prior
that, if each agent α maximizes its own Qα , then Qtotal is to action selection. In particular, hosts_another = 1 if
also maximized. any VNF of the current SFC has been placed in PoP u,
Irrespective of the promising advances in the field, our hosts_previous = 1 if the previous VNF of the current VNF
work builds upon typical IQL for the following reasons. First, has been placed in PoP u, and in_shortest_path = 1 if PoP
IQL is simple enough to facilitate the interpretation of the u is part of the shortest path between the src and the dst of
system’s learning behavior, as it avoids complex NN training the SFC, while they are zero otherwise. These three flags will
schemes. Second, as it is apparent in Section IV-C, we be proven crucial for analyzing the behavior of the DMARL
envisage an action coordination module at the NFVO layer system in Section V-C. As such, we define:
to handle the constraints of SFCP, which can significantly
benefit the overall IQL framework. oα
t = |VR |, src.loc, dst.loc,
| {z }
SFC state
C. DMARL FOR SFCP dR (i), dR (i− , i), dR (i, i+ ), i.order,
| {z }
We implement a DMARL scheme for SFCP. Specifically, we VNF state
generate a single DDQL agent (as described in Section III-B) u.loc, (dS,t (us ), ∀s ∈ u), (dS,t (u, v), ∀v ∈ N (u)), ϵ,
for every PoP u in the substrate network GS . For the rest of | {z }
local state (resources and fingerprint)
this section, we assume that agent α is associated with PoP
u, and agent β with PoP v. hosts_another, hosts_previous, in_shortest_path,
| {z }
local state (flags)
Observations. At time t, each agent α ∈ A observes infor-
mation oα (v.loc, dS,t (v), ∀v ∈ N (u)
t about the current VNF i and the SFC GR to which | {z }
this VNF belongs, as well as the available resources of the neighbors’ condensed state
substrate physical topology GS . In detail, agent α receives
the length of the SFC, the coordinates of src and dst nodes where N (u) denotes the neighboring nodes of u in the graph
of the SFC, the CPU demand of VNF i, the ingress and GS . Note that, contrary to the single-agent case, here the
egress bandwidth requirements of VNF i, and the order of team of agents observes the true state of the environment, i.e.,
this VNF in the SFC, similar to the single-agent observation ∪α∈A oα t = st . However, from the point of view of a single
mentioned in Section III-C. Regarding the physical network, agent, the environment is still partially observable.
agent α observes the coordinates of PoP u, the available CPU Actions. Agents are equipped with a set of actions that
(dS,t (us )) of every server s ∈ u, and the available bandwidth allows them to express their intents based on their local
of each physical link connected to PoP u. state. In detail, all agents share the same discrete set of
We further augment the observation space of agents by actions X = {0, 1, 2}, where 0 indicates low willingness, 1
enabling cross-agent communications. Specifically, we de- expresses a neutral position, and 2 indicates high willingness,
fine N (α) to be the set of neighboring agents of α, i.e., with respect to hosting the current VNF. Note that the above
N (α) = {β : distance(α, β) = 1, ∀β ∈ A − {α}}, action sets are topology-agnostic (in contrast to the action set
where distance(α, β) refers to the length of the shortest path of our single-agent DDQL, where there is one action for each
between PoPs u and v in GS (recall that α operates over u PoP).
VOLUME x, 2022 7
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
(a) Agent 0 (b) Agent 1 (c) Agent 2 (d) Agent 3 (e) Agent 4
FIGURE 6: Action selection rules identified by DMARL agents.
that of DDQL. According to Fig. 5b, the rejection rates are
negligible for both algorithms.
Although informative, Figs. 5a and 5b do not convey a lot
about the differences of the underlying algorithmic behav-
iors. To this end, we also consider Fig. 5c, which depicts the
footprint of optimal SFCPs. The most evident difference here
is centered on the incapacity of DDQL to achieve many 1 PoP
partitionings, which is counterbalanced by its higher capacity (a) Learning progress (b) Optimal and rejected SFCs
to achieve >2 PoP partitionings, compared to DMARL. Our
interpretation of this behavior is as follows: DDQL has a
partial view of the real state, but its observation contains
information about the entire topology (see Section III-C).
This enhances its inherent capacity to distribute VNFs across
many PoPs, since it can reason about the state of the entire
multi-datacenter system. At the same time, its observations
only include the average available CPU capacities of each
PoP (i.e., dS,t (u) in ot ), which limits its confidence in con-
solidating multiple VNFs into the same PoP. For instance, (c) Agent 1 (d) Agent 2
if dS,t (u) = 0.15 for the PoP u ∈ VS , and the current FIGURE 7: Cross-agent communication in the mesh topol-
VNF i demands dR (i) = 0.18, then it is impossible for ogy.
DDQL to be certain whether i fits into a server of PoP
u. Conversely, a DMARL agent α operating over u has a
detailed view of the available CPU resources of all servers
four key rules which, from a human perspective, seem rather
in PoP u, e.g., α observes (dS,t (u1 ) = 0.10, ..., dS,t (us ) =
intuitive for the SFCP problem. Specifically, the increase
0.20, ..., dS,t (u10 ) = 0.15), and, in this case, it can be certain
of p(0|0, 0, 0) implies that the agents tend to express low
that server s of PoP u can accommodate VNF i.
willingness for hosting the current VNF when all flags are
Although the aforementioned limitation can be mitigated
down, i.e., no other VNFs are placed in the associated
by extending DDQL’s observation space with additional
PoP (hosts_another = 0), the previous VNF is placed
metrics (at the risk of further slowing its convergence), we
in another PoP (hosts_previous = 0), and the associated
deem that this analysis indicates the implications of decision-
PoP is not in the shortest path from the src and the dst
making with incomplete information in the context of SFCP.
(in_shortest_path = 0). In contrast, a growing p(2|1, 1, 1)
means that agents tend to express high willingness to host
C. DMARL ANALYSIS
the current VNF when all flags are up. At the same time,
To further delve into the behavior developed by the DMARL the counter-intuitive p(2|0, 0, 0) (high willingness when all
scheme, we employ Fig. 6. Here, we attempt to shed light flags are down) and p(0|1, 1, 1) (low willingness when all
into the learning aspect of the team of agents in the mesh flags are up) seem to decrease over time. Our argument that
topology, by examining certain action selection rules that these rules are indeed learned is backed by the fact that the
have been identified over the course of training. To this end, respective probabilities diverge from the horizontal dotted
we monitor the evolution of p(x|f1 , f2 , f3 ) at every training line at p = 0.33, which indicates random action selection.
step, which shall be interpreted as the probability of taking Nevertheless, these lines never reach either 1.0 or 0.0, which
action x ∈ X = {0, 1, 2}, given that f1 = hosts_another, may imply that the additional observation dimensions (be-
f2 = hosts_previous, and f3 = in_shortest_path (recall sides the three flags) also play an important role in the action
that these are binary flags, defined in Section IV-C, and are selection.
part of the observation oα
t ).
As per Figs. 6a - 6e, every agent manages to discover
10 VOLUME x, 2022
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
(a) Learning progress (b) Optimal and rejected SFCs (a) Learning progress (b) Optimal and rejected SFCs
(c) Behavior of Agents 0 (c) Action selection (com. off) (d) Action selection (com. on)
FIGURE 8: Cross-agent communication in the star topology. FIGURE 9: Cross-agent communication in the linear topol-
ogy.
D. CROSS-AGENT COMMUNICATION
We now evaluate the impact of cross-agent communication path from the src to the dst, even if it does not host another
on the overall efficiency of the DMARL scheme across all VNF or the previous VNF. Note that the respective line for
three PoP topologies shown in Fig. 4. Therefore, we imple- DMARL with communication off is closer to random. This
ment a variant of the DMARL algorithm, where agents do corroborates the fact that enabled communication augments
not receive the neighbors’ condensed state (i.e., location and Agent 0 to perceive its position in the star topology and,
average available CPU in oα t ). hence, to potentially facilitate SFCPs as an intermediate PoP.
Fig. 7 summarizes our findings for the mesh topology Finally, we assess the cross-agent communication feature
(Fig. 4a). According to Figs. 7a and 7b, the DMARL scheme in the linear PoP topology shown in Fig. 4c. Here, results
with the communication feature on dominates the respective regarding the learning progress (Fig. 9a) and the rate of
scheme with communication off. In particular, the former optimal and rejected SFCPs (Fig. 9b) follow a similar trend
manages to converge faster towards an average score of 7.0, to that of the star topology, and the benefits of cross-agent
and its rate of optimal partitionings grows slightly quicker, communication remain. Some insightful micro-benchmarks
compared to the latter. To interpret this performance differ- regarding the difference of the two schemes are illustrated in
ence, we lay out Figs. 7c and 7d for DMARL with communi- Figs. 9c and 9d. Specifically, these bar charts indicate action
cation off, meant to be compared with Figs. 6b and 6c respec- selection frequencies per agent during the last 2, 000 episodes
tively, where DMARL employs cross-agent communication. (where agents are essentially purely exploitative, as ϵ → 0+ ).
We focus on Agent 1 and Agent 2, since they operate over Based on Fig. 9c, Agent 1 shows the most willingness to host
two pivotal PoPs in the mesh topology (i.e., PoPs 1 and 2 VNFs; however, this is contradicting to the linear topology in
are highly relevant for SFCs). From the comparison of the which we would expect Agent 2 to participate in more SFCPs.
respective p(x|f1 , f2 , f3 )-lines, we observe that, when cross- Therefore, Agent 2 cannot perceive its pivotal position in the
agent communication is disabled, action selection rules are linear topology when the communication is off. In contrast,
learned both slower and less firmly. in Fig. 9d, notice that action 2 frequencies are better aligned
Results from the respective comparison in the star topol- with the position of agents in the linear topology. Specifically,
ogy (Fig. 4b) are depicted in Fig. 8. In particular, Figs. 8a we observe that Agent 2 is the most willing to allocate its PoP
and 8b emphasize the superiority of the DMARL algorithm resources to VNFs, followed by its adjacent agents 1 and 3,
that uses cross-agent communication. Here, the former con- while agents 0 and 4 participate in less SFCPs, given their
verges both faster and to a higher average reward, and the outermost position in the linear topology. We also note that,
difference in optimal partitionings is close to 10%. It is worth in principle, when communication is disabled, agents tend to
noting that, we observed slower and less firm identification select action 1 (i.e., neutral position with respect to hosting
of action selection rules in this case also, but the respective a VNF) more frequently, which can be interpreted as lack of
results are omitted due to space limitations. Instead, we plot confidence in their learning capacity.
Fig. 8c, where we zoom into a behavioral aspect of Agent 0 According to the analysis above, cross-agent communi-
(since PoP 0 is a pivotal PoP in the star topology). Specif- cation of short, yet informative, messages (i) enables faster
ically, p(0|0, 0, 1) is lower for Agent 0 when cross-agent and firmer identification of action selection rules, and (ii)
communication is enabled, meaning that this agent is still enhances the agents’ ability to identify their position within
more likely to select actions 1 or 2 when it is in the shortest the topology and adjust their action selection behavior ac-
VOLUME x, 2022 11
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
and, subsequently, utilizes Dijkstra’s method to compute link ing VNFs. Second, our DMARL scheme can indeed claim
assignments. decision-making with complete information, since, from the
Pei et al. [7] propose an SFCP framework which leverages perspective of the team of agents, the environment is fully
DDQL. Here, the decision-making process is subdivided observable. This is not the case with many studies, which
into three steps. First, given a network state, a preliminary either define incomplete representations of the true state or
evaluation of each action is performed, i.e., computation of assume very detailed knowledge at the orchestration layer
respective Q-values. Out of all actions, all but the k best are (e.g., [4], [6]–[8], [10], [11]). Third, the action spaces used
discarded. In the second step, these k actions are executed by individual DDQL agents of the DMARL framework are
in simulation-mode, and the actual reward and next states topology-agnostic (as opposed to e.g., [4]–[11]), which al-
are observed. Finally, the action with the highest reward is leviates scalability limitations and renders our framework a
performed on the physical infrastructure. promising candidate for dynamic topologies.
Zheng et al. [8] investigate the deployment of SFCs in the
context of cellular core. To this end, authors put both the B. RL IN MULTI-DOMAIN SFCP
intra-datacenter and the intra-server placement problem into Multi-domain SFCP is commonly solved via distributed
the frame. Focusing our discussion on the intra-datacenter methods [17]. This can be attributed to the limited infor-
placement, authors devise an approximation algorithm for the mation sharing among different providers, which hinders the
optimized SFCP, under the assumption that the resource re- possibility of solving SFCP in a centralized fashion.
quirements and the lifespan of each SFC are fully disclosed. In [28], authors assign a DDPG-based RL agent to each
However, given the strictness of this hypothesis, authors domain. Further, the SFC request is encoded by the client
implement an SFCP scheme that can cope with demand (i.e., the entity that wants to deploy the SFC) and conveyed to
uncertainty, as well. In particular, the proposed method is each RL agent. The latter then treat the SFC encoding as state
conventional Q-learning. Here, a state st is expressed as and compute an action, which is effectively the bidding price
the SFC type that needs to be deployed at time t, while an for renting out their resources to the SFC request. Prices are
action xt represents the selection of a server. Nonetheless, accumulated by the client, who opts for the best combination
conventional Q-learning is not a viable option when the state- based on a cost-based first fit heuristic. The MARL setting
action space is vast. here is inherently competing, as agents do not have any
Wang et al. [11] and Jia et al. [12] investigate SFCP incentive to maximize the rewards of other agents.
through the lens of fault-tolerance, essentially considering Toumi et al. [22] propose a hierarchical MARL scheme
the deployment of redundant VNFs or entire SFCs which which employs a centralized agent (i.e., multi-domain or-
shall be engaged in the event of malfunction in the run- chestrator - MDO) on top of multiple local-domain agents,
ning SFC. In particular, Wang et al. [11] devise a DDQL where all agents are implemented with the DDQL architec-
algorithm that is trained to compute an ordered pair (u, v) ture. The MDO receives the SFC request, and decides which
where u is the index of the DC for proper deployment, and local domain will host each constituent VNF. Afterwards,
v the backup DC where the standby SFC instance will be the agents of local domains are responsible for placing the
deployed. Effectively, this implies that the entire SFCs will sub-SFCs within their own nodes. Admittedly, the proposed
be placed in a single PoP. Jia et al. [12] decompose the SFC MDO is quite similar to our own DDQL implementation
scheduling problem with reliability constraints into two sub- for SFCP. If we assume that the partial observation of our
problems. First, they use a heuristic to determine the number DDQL due to data aggregation is equivalent to the partial
of redundant VNF instances (per VNF). Then, they employ observation of the MDO due to limited information sharing
an A3C algorithm which, in conjunction with a rule-based across multiple domains, then we can think of the comparison
node selection approach, facilitates the process of mapping in Section V-B as a comparison of DMARL with MDO5 .
these instances onto compute nodes. In practice, the DRL In [29], Zhu et al. devise a MARL scheme for SFCP over
algorithm learns whether to defer or not the deployment of IoT networks. Specifically, they propose a hybrid architec-
a VNF, instead of learning where to map it. However, given ture that employs both centralized training and distributed
such a topology-independent action space, the proposed DRL execution strategies. Here, the SFCP problem is modeled as
scheme is not sensitive to evolving topologies. a multi-user competition game model (i.e., Markov game) to
account for users’ competitive behavior. In their implemen-
In contrast to the above, our work promotes single-domain tation, the centralized controller performs global information
SFCP in a decentralized, yet coordinated fashion, addressing statistical learning, while each user deploys service chains in
various limitations of existing studies. First, the proposed a distributed manner, guided by a critic network.
DMARL scheme empowers local controllers (i.e., VIMs)
with the ability to express their own resource utilization While multi-domain SFCP is vertical to single-domain SFCP
intents based on detailed local observations (as opposed to on the grounds of information disclosure, the related so-
e.g., [4]–[12]). This is explicitly achieved by designating 5 We acknowledge that these two sources of partial observability are not
one DDQL agent per VIM, and by utilizing action sets equivalent, hence a direct comparison of DMARL with MDO makes little
that express the degree of willingness with respect to host- sense.
VOLUME x, 2022 13
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
lutions exhibit some similarities to the proposed DMARL action selection policies.
scheme. In effect, agents voting on hosting VNFs can be While this work sheds light on numerous fundamental
paralleled to agents bidding for resources, while our coor- aspects of distributed resource allocation within NFV, at
dination module can be seen as a centralized broker which the same time it paves the way for several new research
aggregates votes and computes the best action. However, the directions. For instance, having established the perks of de-
intrinsic competition that underlies multi-domain scenarios centralized SFCP, a natural next step is to examine a multi-
cannot be overlooked. For example, notice how agents in agent RL algorithm at which agents shall learn what, when, to
multi-domain SFCP strive to maximize their own rewards, whom and how to communicate; in other words, agents shall
disregarding the rewards of others. As such, our DMARL learn their own cross-agent communication protocol.
algorithm cannot be directly compared to any solution per-
taining to the above. REFERENCES
[1] M. Rost and S. Schmid, “On the hardness and inapproximability of virtual
network embeddings,” IEEE/ACM Transactions on Networking, vol. 28,
VII. DISCUSSION no. 2, pp. 791–803, 2020.
Our work exhibits certain shortcomings, which we deem im- [2] D. Dietrich, A. Abujoda, A. Rizk, and P. Papadimitriou, “Multi-provider
portant to discuss and clarify. With respect to the underlying service chain embedding with nestor,” IEEE Transactions on Network and
Service Management, vol. 14, no. 1, pp. 91–105, 2017.
algorithmic framework we opted for, which is IQL, we have [3] A. Pentelas, G. Papathanail, I. Fotoglou, and P. Papadimitriou, “Network
already acknowledged its lack of convergence guarantees. service embedding across multiple resource dimensions,” IEEE Transac-
The natural next step is to examine a DMARL algorithm for tions on Network and Service Management, vol. 18, no. 1, pp. 209–223,
2020.
SFCP based on centralized training - decentralized execution [4] S. Haeri and L. Trajković, “Virtual network embedding via monte carlo
schemes, such as the ones proposed in [25] and [26], where tree search,” IEEE transactions on cybernetics, vol. 48, no. 2, pp. 510–
finding the optimal joint policy is theoretically guaranteed 521, 2017.
[5] Y. Xiao, Q. Zhang, F. Liu, J. Wang, M. Zhao, Z. Zhang, and J. Zhang,
(under certain assumptions). Additional limitations which we “Nfvdeep: Adaptive online service function chain deployment with deep
identify mostly pertain to the analysis of our method, rather reinforcement learning,” in Proceedings of the International Symposium
on the method itself. In particular, we evaluate DMARL over on Quality of Service, 2019, pp. 1–10.
[6] P. T. A. Quang, Y. Hadjadj-Aoul, and A. Outtagarts, “A deep reinforcement
relatively small multi-PoP topologies. Yet, the intention of learning approach for vnf forwarding graph embedding,” IEEE Transac-
this work is to explore in-depth the learning behavior that tions on Network and Service Management, vol. 16, no. 4, pp. 1318–1331,
is developed by a team of independent RL agents in order to 2019.
[7] J. Pei, P. Hong, M. Pan, J. Liu, and J. Zhou, “Optimal vnf placement via
cooperatively solve SFCP. To this end, keeping the topologies deep reinforcement learning in sdn/nfv-enabled networks,” IEEE Journal
small, not only allows us to delve easier into the behavior of on Selected Areas in Communications, vol. 38, no. 2, pp. 263–278, 2019.
individual agents, but to demonstrate the respective findings [8] J. Zheng, C. Tian, H. Dai, Q. Ma, W. Zhang, G. Chen, and G. Zhang,
“Optimizing nfv chain deployment in software-defined cellular core,”
as well (e.g., Fig. 6). IEEE Journal on Selected Areas in Communications, vol. 38, no. 2, pp.
248–262, 2019.
VIII. CONCLUSIONS [9] R. Solozabal, J. Ceberio, A. Sanchoyerto, L. Zabala, B. Blanco, and
F. Liberal, “Virtual network function placement optimization with deep
Irrespective of the above constraints, both IQL and our eval- reinforcement learning,” IEEE Journal on Selected Areas in Communica-
uation methodology enable us to obtain valuable knowledge. tions, vol. 38, no. 2, pp. 292–303, 2019.
Specifically, the comparison in Section V-B indicates that [10] Z. Yan, J. Ge, Y. Wu, L. Li, and T. Li, “Automatic virtual network embed-
ding: A deep reinforcement learning approach with graph convolutional
DMARL outperforms a centralized state-of-the-art SFCP networks,” IEEE Journal on Selected Areas in Communications, vol. 38,
method based on DDQL, which we mainly attribute to the no. 6, pp. 1040–1057, 2020.
full observability of the former against the partial observabil- [11] L. Wang, W. Mao, J. Zhao, and Y. Xu, “Ddqp: A double deep q-learning
approach to online fault-tolerant sfc placement,” IEEE Transactions on
ity of the latter. Concretely, DDQL fails to consolidate VNFs Network and Service Management, vol. 18, no. 1, pp. 118–132, 2021.
to the extend DMARL does, given the fact that it is not aware [12] J. Jia, L. Yang, and J. Cao, “Reliability-aware dynamic service chain
of the exact server capacities. We further corroborate that the scheduling in 5g networks based on reinforcement learning,” in IEEE IN-
FOCOM 2021-IEEE Conference on Computer Communications. IEEE,
team of agents manages to recognize certain intuitive action 2021, pp. 1–10.
selection rules in Section V-C. This is highly crucial, espe- [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
cially considering efforts towards machine-learning explain- Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al.,
“Human-level control through deep reinforcement learning,” nature, vol.
ability. Moreover, we identify that disabling the exchange 518, no. 7540, pp. 529–533, 2015.
of concise messages among neighboring agents limits their [14] D. Dietrich, A. Rizk, and P. Papadimitriou, “Multi-domain virtual network
ability to perceive their position within the multi-PoP system, embedding with limited information disclosure,” in 2013 IFIP Networking
Conference. IEEE, 2013, pp. 1–9.
which has a negative effect on the developed action selection [15] P. T. A. Quang, A. Bradai, K. D. Singh, and Y. Hadjadj-Aoul, “Multi-
behaviors (Section V-D). Last, we quantify the performance domain non-cooperative vnf-fg embedding: A deep reinforcement learning
drop of DMARL over multi-PoP systems which exhibit non- approach,” in IEEE INFOCOM 2019-IEEE Conference on Computer
Communications Workshops (INFOCOM WKSHPS). IEEE, 2019, pp.
zero packet loss rates (Section V-E). The key takeaway here 886–891.
is that it is highly important for DMARL agents to at least be [16] N. Toumi, M. Bagaa, and A. Ksentini, “Hierarchical multi-agent deep
aware of the position of their neighbors, which implies that reinforcement learning for sfc placement on multiple domains,” in 2021
IEEE 46th Conference on Local Computer Networks (LCN). IEEE, 2021,
location coordinates play an important role in the individual pp. 299–304.
14 VOLUME x, 2022
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3269576
[17] J. C. Cisneros, S. Yangui, S. E. P. Hernández, and K. Drira, “A survey on Danny De Vleeschauwer (Member, IEEE) received the M.Sc. degree in
distributed nfv multi-domain orchestration from an algorithmic functional electrical engineering and the Ph.D. degree in applied sciences from Ghent
perspective,” IEEE Communications Magazine, vol. 60, no. 8, pp. 60–65, University, Belgium, in 1985 and 1993, respectively. He currently is a
2022. DMTS in the Network Automation Department of the Network Systems and
[18] F. Carpio, S. Dhahri, and A. Jukan, “Vnf placement with replication for Security Research Lab of Nokia Bell Labs, in Antwerp, Belgium. Prior to
loac balancing in nfv networks,” in 2017 IEEE international conference joining Nokia, he was a Researcher in Ghent University. His early work was
on communications (ICC). IEEE, 2017, pp. 1–6. on image processing and on the application of queuing theory in packet-
[19] A. Pentelas and P. Papadimitriou, “Network service embedding for cross- based networks. His current research interest includes the distributed control
service communication,” in 2021 IFIP/IEEE International Symposium on of applications over packet-based networks.
Integrated Network Management (IM). IEEE, 2021, pp. 424–430.
[20] ——, “Service function chain graph transformation for enhanced resource
efficiency in nfv,” in 2021 IFIP Networking Conference (IFIP Network- Chia-Yu Chang (Member, IEEE) received the Ph.D. degree from Sorbonne
ing). IEEE, 2021, pp. 1–9. Université, France, in 2018. He is currently a Network System Researcher at
[21] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with Nokia Bell Labs. He has more than 12 years of experience in algorithm/pro-
double q-learning,” in Proceedings of the AAAI conference on artificial tocol research on communication systems. He has extensive research expe-
intelligence, vol. 30, no. 1, 2016. rience at both academic and industrial laboratories, including EURECOM
[22] N. Toumi, M. Bagaa, and A. Ksentini, “Hierarchical multi-agent deep Research Institute, MediaTek, Huawei Swedish Research Center, and Nokia
reinforcement learning for sfc placement on multiple domains,” in 2021 Bell Labs. He has participated in several European Union’s collaborative
IEEE 46th Conference on Local Computer Networks (LCN). IEEE, 2021, research and innovation projects, such as COHERENT, SLICENET, 5G-
pp. 299–304. PICTURE, 5Growth, and DAEMON. His research interests include wireless
[23] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative communication, computer networking, edge computing, with particular in-
agents,” in Proceedings of the tenth international conference on machine terest in network slicing, traffic engineering, and AI/ML supported network
learning, 1993, pp. 330–337. control.
[24] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent reinforce-
ment learners in cooperative markov games: a survey regarding coordina-
tion problems,” The Knowledge Engineering Review, vol. 27, no. 1, pp. Koen De Schepper received the M.Sc. degree in industrial sciences (elec-
1–31, 2012. tronics and software engineering) from IHAM Antwerpen, Belgium. He
[25] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, joined Nokia (then Alcatel), in 1990, where during the first 18 years, he
M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls et al., was Platform Development Leader and Systems Architect. He has been
“Value-decomposition networks for cooperative multi-agent learning,” with Bell Labs for the past 14 years, currently in the Network Systems
arXiv preprint arXiv:1706.05296, 2017. and Security Research Laboratory. Before, he worked mainly on transport
[26] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and layer protocols (L4) and their network support for scalable (SCAP) and
S. Whiteson, “Qmix: Monotonic value function factorisation for deep low latency (L4S) content delivery. His current research interests include
multi-agent reinforcement learning,” in International Conference on Ma- programmable data plane and traffic management, customizable network
chine Learning. PMLR, 2018, pp. 4295–4304. slicing, and AI supported dynamic service and network control.
[27] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli,
and S. Whiteson, “Stabilising experience replay for deep multi-agent
reinforcement learning,” in International conference on machine learning. Panagiotis Papadimitriou (Senior Member, IEEE) is an Associate Profes-
PMLR, 2017, pp. 1146–1155. sor at the department of Applied Informatics in the University of Macedonia,
[28] P. T. A. Quang, A. Bradai, K. D. Singh, and Y. Hadjadj-Aoul, “Multi- Greece. Before that, he was an Assistant Professor at the Communications
domain non-cooperative vnf-fg embedding: A deep reinforcement learning Technology Institute of Leibniz Universität Hannover, Germany, and a
approach,” in IEEE INFOCOM 2019-IEEE Conference on Computer member of L3S research center in Hanover. Panagiotis received a Ph.D.
Communications Workshops (INFOCOM WKSHPS). IEEE, 2019, pp. in Electrical and Computer Engineering from Democritus University of
886–891. Thrace, Greece, in 2008, a M.Sc Information Technology from University of
[29] Y. Zhu, H. Yao, T. Mai, W. He, N. Zhang, and M. Guizani, “Multiagent Nottingham, UK, in 2001, and a B.Sc. in Computer Science from University
reinforcement-learning-aided service function chain deployment for inter- of Crete, Greece, in 2000. He has been a (co-)PI in several EU-funded (e.g.,
net of things,” IEEE Internet of Things Journal, vol. 9, no. 17, pp. 15 674– NEPHELE, T-NOVA, CONFINE, NECOS) and nationally-funded projects
15 684, 2022. (e.g., G-Lab VirtuRAMA, MESON). Panagiotis was a recipient of Best
Paper Awards at IFIP WWIC 2012, IFIP WWIC 2016, and the runner-
up Poster Award at ACM SIGCOMM 2009. He has co-chaired several
international conferences and workshops, such as IFIP/IEEE CNSM 2022,
IFIP/IEEE Networking TENSOR 2021–2020, IEEE NetSoft S4SI 2020,
Angelos Pentelas (Student Member, IEEE) obtained the B.Sc. in Mathemat- IEEE CNSM SR+SFC 2018–2019, IFIP WWIC 2017–2016, and INFOCOM
ics from the Aristotle University of Thessaloniki, Greece, and the M.Sc. in SWFAN 2016. Panagiotis is also an Associate Editor of IEEE Transactions
Applied Informatics from the University of Macedonia, Greece. Currently, on Network and Service Management. His research activities include (next-
he is pursuing the Ph.D. degree at the same department. From Oct ’21 to generation) Internet architectures, network processing, programmable data-
Mar ’22, he has been with Nokia Bell Labs, Belgium, as a Ph.D. Intern. His planes, time-sensitive networking (TSN), and edge computing.
research is centered on decision-making methods for network orchestration.
VOLUME x, 2022 15
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4