Neural Network Tomography
Neural Network Tomography
Abstract—Network tomography, a classic research problem in of ICMP/data packets. In risk-sensitive applications, security
the realm of network monitoring, refers to the methodology of policies may even block such measurements.
inferring unmeasured network attributes using selected end-to- Alternatively, the end-to-end approach provides a solution
end path measurements. In the research community, network
tomography is generally investigated under the assumptions that does not require the cooperation of internal network
of known network topology, correlated path measurements, elements or the equal treatment of control/data packets. It
bounded number of faulty nodes/links, or even special network relies on end-to-end path performance metrics (e.g., path
protocol support. The applicability of network tomography is delays or bandwidths) experienced by data packets to infer
considerably constrained by these strong assumptions, which the unknown network information using network tomography.
arXiv:2001.02942v1 [cs.NI] 9 Jan 2020
[35]. Finally, little is known for a solution that simultaneously approaches [40], [1], [48] and multicast [49], [50], [10], [39]
addresses both additive and non-additive tomography prob- are needed to estimate the link metric distributions. When all
lems. but k link metrics are zero, compressive sensing techniques
In this paper, we establish a generic and lightweight to- are used to identify the k non-zero link metrics [51], [52],
mography framework that removes all above assumptions and [47]. With additional assumptions of controllable routing, [53]
constraints, thus applicable to most practical network setting. derives necessary and sufficient conditions on the network
The input to our tomography framework is only a set of topology for identifying all link metrics, given that monitors
end-to-end path measurements w.r.t. some node pairs, and can measure any cycles. A similar study in [54] quantifies
the output is the predicted path performance metrics for all the minimum number of measurements needed to identify a
unmeasured node pairs. For each input data point, the only broader set of link metrics. Moreover, [55], [56], [57] de-
available information is the starting/terminating nodes and velop measuring vantage placement algorithms for performing
their corresponding path performance metric. The proposed efficient path measurements. Since routing along cycles is
framework, called NeuTomography, is based on deep neural typically prohibited, these methods are not widely applicable.
networks [43], which learn the non-linear relationship between If only measuring cycle-free paths are allowed, then [12], [13]
node pairs and their path performance metrics. establish the necessary and sufficient topological conditions
Comparing to existing tomography solutions, NeuTomogra- for link metric determination. Yet, such cycle-free controllable
phy is generic and easily applicable in that it does not require source routing still has limited support in practice.
additional network knowledge or rely on specific performance When the performance metrics are non-additive, additional
metric type (additive or non-additive). Moreover, since the constraints are typically imposed. Under the assumption that
given measured node pairs potentially can be any subset of all multiple simultaneous failures happen with negligible prob-
node pairs in the network, i.e., there may exist measurement ability, [5], [58], [59], [60], [55], [57] target to detect and
bias, we further propose Path Augmented Tomography (PAT) localize the bottleneck in the network. To improve the resolu-
that proactively constructs additional input data by estimating tion in characterizing failures, range tomography in [61] not
the performance bounds of unmeasured node pairs using the only localizes the failure, but also estimates its severity (e.g.,
given path measurements. Extensive experiments via both congestion level). These works, however, ignore the fact that
Rocketfuel [44] and CAIDA’s ITDK [45] network data show multiple failures occur more frequently than one may imagine
that by measuring only 30% node pairs, NeuTomography is [62]. To address the issue of multiple failures, [41], [42], [63],
able to accurately predict the path performance of the rest 70% [64], [65], [66] attempt to find the minimum set of network
node pairs, with the mean absolute percentage error (MAPE) elements whose failures explain the observed path states. In
as small as 2%, and PAT further reduces MAPE by up to a Bayesian formulation, [67], [68] estimate the failure proba-
50%; such results are orders of magnitude improvement over bilities of different links. For the case of binary performance
benchmarks. Finally, although we are not given any informa- metrics (failed or normal), if the number of failed links is upper
tion about the network topology, NeuTomography provides a bounded by k and the measuring vantages can probe arbitrary
solution to efficiently reconstruct the network topology with cycles or paths, [69], [70], [31] focus on placing measuring
different granularities utilizing only the given end-to-end node vantages and constructing measurement paths to localize a
pair measurements, thus revealing more insights to network given number of failures. Furthermore, in [28], [29], [30], [32],
operators for resource optimizations. [33], [34], [35], efficient testing conditions and algorithms are
proposed to quantify the capability of localizing node failures.
However, for arbitrary valued non-additive link metrics, few
A. Further Discussions on Related Work positive results are known. In this regard, we build the neural-
Network tomography can be categorized into passive [46] network-based tomography framework for such general non-
and proactive tomography [8]. Passive tomography refers to additive performance metrics.
a technology of inferring network performance metrics by
passively observing the existing traffic attributes [47]. How- B. Summary of Contributions
ever, passive tomography requires additional assumptions, e.g., Our contributions are four-fold:
correlated performance metrics, to assist the inference task, 1) We propose for the first time a deep neural-network-based
thus limiting its applicability. In contrast, active tomography generic and lightweight tomography framework (NeuTomogra-
proactively measures some performance metrics; it is more phy) for network monitoring tasks using only end-to-end path
useful in practical network monitoring tasks, and therefore is measurements of a subset of node pairs without requiring any
the focus of this paper. additional assumptions on the network.
For active tomography, the most important branch is identi- 2) We build algorithm Path Augmented Tomography (PAT)
fying additive performance metrics. Under the assumption of to improve the performance prediction accuracy using esti-
known network topology and node/link involvement in each mated performance bounds as the augmented input data.
measurement path, the problem is formulated as solving a 3) Although no prior knowledge about the network topology
system of linear equations. Yet, even under such assumptions, is given, we establish one method using the proposed tomog-
it is frequently impossible to uniquely identify all unmea- raphy framework to reconstruct the network topology.
sured link metrics from path measurements because the linear 4) Extensive experiments using real data confirm the high
system is not always invertible [7], [8], [9], and statistical accuracy of NeuTomography in predicting path performance
3
metrics of unmeasured node pairs; the reconstructed network unknown network topology Given path
topologies also exhibit small errors. Such results are orders of v1 v2 performance What are the path
measurements performance metrics of
magnitude improvement over baseline solutions. v3 of node pairs the following node pairs?
{v1, v2} {v1, v3}, {v2, v5}
The rest of the paper is organized as follows. Section II {v1, v4} {v1, v5}, {v3, v6}
{v2, v6} {v1, v6}, {v4, v5}
formulates the problem. Section III discusses the challenges. v4 {v3, v4} {v2, v3}, {v4, v6}
Section IV presents the proposed tomography framework. Real v5 v6 {v2, v4}, {v5, v6}
{v3, v5}
network data are employed in Section V for evaluations. : end-to-end path measurement (where intermediate traversed nodes are unknown)
Let V denote the set of nodes in G (|V | = n) and set end path performance metrics for all unmeasured node pairs,
{v i.e., in set T \ S (recall that T is the complete node pair set).
i , vj } (vi , vj ∈ V , vi 6= vj ) a node pair. Then T =
{vi , vj } vi ,vj ∈V,vi 6=vj is a set containing all node pairs in V Our second objective is to reconstruct the original network
(i.e., |T | = n2 ). For a given performance metric of interest
topology based on the given path performance measurements
(any metric as discussed in Section II-A), suppose we are given of node pairs in set S when the type of performance metrics
the measured end-to-end path performance metric associated allows; see Section IV-C for details.
with each node pair in a subset S of T , i.e., S ⊂ T . Then Discussions: In practical networks, generally only a small
we explore how to infer the unmeasured path information portion of end-to-end path measurements are available. There-
as accurately as possible purely based on this available set fore, for the first objective, we aim to establish a framework
of path measurements. To make our problem more practical that exhibits high accuracy when |S| is small, thus applicable
and applicable to real networks, we do not constrain the way to real network monitoring tasks. Note that in cases where
4
some node pairs in T \ S are directly connected by links, IV. G ENERIC N ETWORK T OMOGRAPHY F RAMEWORK
i.e., neighbors in the underlying unknown network topology, In this section, we propose a neural-network-based network
their inferred path performance metrics correspond to their tomography framework to address our objectives.
link performance metrics if these links are selected for routing.
While regarding the second objective, the significance of it is
that the topological information is critical to many network A. Neural-Network-Based Tomography Framework (NeuTo-
applications and operations, e.g., traffic engineering, fault mography)
localization, etc. For neural networks, they are shown to be exceptionally
powerful in the field of machine learning [43]. Moreover,
the universal approximation theorem [71] proves that neural
III. P ROBLEM C HALLENGES AND R ELATIONS TO networks are capable of approximating any non-linear func-
C LASSICAL N ETWORK T OMOGRAPHY P ROBLEMS tions as long as the hidden layer consists of sufficient number
Our problem formulation in Section II is a generalization of neurons. As such, we use neural networks to address the
of classical network tomography problems. In particular, let c potential non-linearality in our network tomography problem.
denote the path performance metric vector of all node pairs in The neural network is a mathematical architecture where the
set S. Then the network tomography problem can be described training variables are continuous. In contrast, both routing ma-
by trices R and R0 in our problem are binary (see the discussions
Section III). Nevertheless, our ultimate goal is to determine
R ⊗ w = c, (1) R0 ⊗w rather than individual R0 and w. Therefore, we propose
to relax the binary values in the routing matrices R and R0 to
where w is the performance metric vector of all links in the continuous values ranging from 0 to 1, thus forming stochastic
network with entry wi denoting the performance metric of routing matrices, denoted by R e and Re 0 , respectively. Then
link li , R = (Ri,j ) is called the routing matrix with each e e e 0 e 0
entry Ri,j in R (or Ri,j in R ) indicates the probability that
entry Ri,j ∈ {0, 1} representing whether link lj is present link lj exists on the path connecting the i-th node pair in S
on the path between the i-th node pair in S, and ⊗ is the (or T \ S). By such routing matrix relaxation, we propose a
operator indicating how link metrics are related to the end-to- neural-network-based network tomography framework, called
end path metrics. In particular, the meaning of ⊗ depends on NeuTomography, as shown in Figure 2.
the problem considered as described as follows.
Figure 2 is a neural network model consisting of k fully-
1) Additive metric tomography: For this class of problems, connected hidden layers [43], where each hidden layer con-
the common assumption is that R and c are both known, and tains γ neurons. Here, γ is the estimated number of links in
⊗ is simply the matrix multiplication. the network. Note that the exact number of links is unknown,
2) Boolean metric tomography: For Boolean metric tomog- and we discuss later how the value of γ is estimated and how
raphy, all performance metrics are binary, where 0 represents γ affects the inference accuracy. For this model, each input is
“normal” and 1 represents “failed”. In this case, ⊗ is Boolean a node pair from set S. At the input layer, the node pair, say
matrix product, i.e., ci = ∨j (Ri,j ∧ wj ). v1 and v2 , is mapped to an n-dimensional “one-hot” vector
3) Congestion level (or bandwidth) tomography: For con- v0 (recall that n is the total number of nodes), where only
gestion level tomography, operator ⊗ finds the most congested positions corresponding to v1 and v2 are 1 and 0 elsewhere.
link in the given path, i.e., ci = maxj (Ri,j wj ). While for Next, as in typical fully connected neural networks, v0T is
bandwidth tomography, ⊗ finds the link with the minimum multiplied by an n × γ matrix M1 and then added by a bias
bandwidth, i.e., ci = minj (Ri,j wj ). vector b1 . The resulting v0T M1 +bT1 is passed to hidden layer
Discussions: For all these classical problems, R is assumed 1 and taken as input by an activation function [43] σ(·), i.e.,
to be known and the number of faulty links is generally
bounded. Then, w and R0 , which is the routing matrix corre-
sponding to the unmeasured node pairs in T \ S, are inferred γ-dim γ-dim γ-dim
for computing the path performance metrics for node pairs in n-dim
T \ S via R0 ⊗ w. In comparison, our problem setting is more 0
input vector with two ones
…
hidden layer 1 outputs σ(v0T M1 +bT1 ). Each of the following- of T ; moreover, for the training data, only a small percentage
up hidden layers has the same activation function σ(·) and of node pairs are measured (e.g., for the experiments in
operates by the same way. Thus, let the output vector from Section V, |S|/|T | ≤ 30%), thus potentially causing model
hidden layer j − 1 be vj−1 ; then the output from hidden layer overfitting [43]. For instance, if the input data only include
j (j ≤ k) is vjT = σ(vj−1
T
Mj + bTj ), where Mj is a γ × γ node pairs that are less than 3-hop away, then the predicted
weight matrix between hidden layers j − 1 and j, and bj is distance for unmeasured node pairs are also up to 3 hops
the corresponding bias vector. Finally, vkT generated by hidden though the network diameter (which is not directly given in the
layer k is multiplied by a γ × 1 weight vector m, i.e., vkT m, input data) might be substantially larger than 3. As such, we
as the final path performance metric between the input node propose one algorithm that leverages S to construct additional
pair v1 and v2 at the output layer (only one neuron and no input data to improve the prediction accuracy.
activation function or bias in the final output layer). 1) Motivation and Algorithm Sketch: For each node pair
Design Intuitions of NeuTomography: In NeuTomography, φ ∈ S, let dφ denote the measured path performance metric
we select sigmoid [43] as the activation function σ(·) across w.r.t. φ, and V the set of nodes appearing in the node pairs of
all hidden layers, i.e., to represent the probability of each link set S. Then, the measurement data can be directly mapped to
appearing on paths. Its design intuition is as follows: a weighted graph G 0 = V, S, {dφ }φ∈S , where V is the set
1) When performance metrics are additive, for the input of vertices, S specifies the end-points of all edges in G 0 , and
node pair, say the i-th node pair in set S, intuitively, the {dφ }φ∈S (i.e., path performance metrics w.r.t. S in the input
purpose of all hidden layers is to compute the i-th row R e i,: of data) are the corresponding edge weights in G 0 . In G 0 , for each
the stochastic routing matrix R (each entry value is between
e node pair µ in T \ S, there exists a path Pµ connecting the
0 and 1). Then the weight vector m connecting to the output nodes in µ if G 0 is a connected graph. If G 0 is disconnected and
layer represents the approximated metrics for all links in there exist node pairs which are not connected by any paths
the network. Moreover, the mapping from node pairs to the in G 0 , then these node pairs are not selected for augmented
routing matrix is highly complicated and likely to be non- input data. Our idea is to use the performance metric of Pµ ,
linear. Therefore, we use multiple (k) hidden layers to capture denoted by deµ , as the estimation of the real path metric of the
such relations, where each additional layer tries to refine the unmeasured node pair µ, and feed (µ, deµ ) to NeuTomography
probability of each link appearing on a particular path. as additional data. Then deµ is updated iteratively using its
2) When performance metrics are non-additive, e.g., initial estimation and the predicted value by NeuTomography.
congestion level and bandwidth, the design intuition is that Lastly, the refined deµ is returned as the final inferred path
for the i-th input node pair, the goal of the k-th hidden layer is performance metric for µ ∈ T \ S. Based on this idea, we
to output a “one-hot” vector vk with 1 in only one position and propose a tomography algorithm with augmented data, called
0 elsewhere. In this “one-hot” vector, the position with value Path Augmented Tomography (PAT).
1 corresponds the most problematic link for the congestion 2) Path Augmented Tomography (PAT): Complete algo-
level and bandwidth tomography. On the other hand, since our rithm of PAT is presented in Algorithm 1. In PAT, we first
objective is to accurately predict the product of vkT and m and compute the path performance estimation deµ for each node
vk and m are not unique for the same product, we only need pair µ in T \ S by lines 1–3. Path performance estimations are
to train the neural network model for such product instead of carried out such that deµ corresponds to the path with the best
individual vk and m, thus easing the training process. performance metric on G 0 w.r.t the given tomography task.
Next, with this path performance estimation, we iteratively
Discussions: Our goal is to predict R0 ⊗ w by only using
train the neural network framework. Specifically, from the
R ⊗ w, where R and R0 are related by the underlying routing
|T \S| unmeasured node pairs, α|T \S| random node pairs with
protocol(s), and no prior knowledge about R, R0 , or w is
the estimated path performance values are selected (by line 5)
available. The gist of NeuTomography is to capture such
and combined with the given measurement data (line 6) as
relations among R, R0 , and w by the estimated number of
the augmented training data to train NeuTomography (line 7).
links γ and multiple hidden layers. In Section V, by extensive
Note that β in line 9 equals zero for node pairs that are
experiments, we show that by using only a small portion of
not within the same component when G 0 is disconnected.
measurements as the input data, NeuTomography is capable of
After this training process, the path performance estimation
learning the accurate relations between R and R0 irrespective
{deµ }µ∈T \S is updated by lines 8–10. In particular, parameter
of the type of performance metrics. Moreover, we show that
β (0 ≤ β < 1) is employed in line 9 to balance the estimated
NeuTomography is robust against the estimation error of the
and the predicted value so as to avoid overfitting. Such training
number of links (γ).
and value updating process is repeated until the maximum
number of iterations is reached. Finally, Algorithm 1 outputs
B. Path Measurement Augmentation {deµ }µ∈T \S .
In Section IV-A, NeuTomography purely utilizes the given Discussions of Algorithm 1: There are two key operations in
measurements of node pairs in S to predict the path per- Algorithm 1, i.e., the first “foreach” loop which initializes path
formance of node pairs in T \ S. However, the measured performance metric estimations for unmeasured node pairs
performance metric distribution might be different from the based on G 0 ; and the second “while” loop which iteratively up-
actual performance metric distribution as S can be any subset dates these estimations using predictions made by NeuTomog-
6
etfuel [44] and ITDK [45] projects, which represent IP- B. Benchmark Solutions
level connections between backbone/gateway routers of several
ASes from major Internet Service Providers (ISPs) around To study the performance of NeuTomography, it is com-
the globe. The parameters of selected networks obtained from pared against the following benchmarks.
these two projects are listed in Table I, where AS15706 is from 1) Minimum Monitor Placement and Determination of All
ITDK and others are from Rocketfuel and the last five columns Identifiable Links (MMP+DAIL). MMP+DAIL [12], [25] is a
are unknown to NeuTomography (as discussed in Section II). state-of-the-art tomographic solution for additive performance
Note that “D” and “ASPL” in Table I stand for network metrics, under the assumption of known network topology
diameter and average shortest path length, respectively; both and controllable cycle-free routing. In particular, MMP places
in terms of the number of hops. Since the Rocketfuel and the minimum number of measuring vantages to ensure all
ITDK projects do not directly provide path measurements, measurement paths are sufficient to accurately compute all link
we consider the following three aspects to generate path metrics. While DAIL determines all links whose performance
measurements using the available network data for evaluating metrics are accurately inferable under the given measuring
NeuTomography. vantages. To employ MMP+DAIL as a benchmark, we test
it under erroneous topological information as follows. Let
Remark: The purpose here is only to provide a method G = (V, E) be the actual network topology (V /E set of
to generate measurement paths to evaluate NeuTomography. nodes/links in G) and G 0 = (V 0 , E 0 ) the perceived network
Besides the end-to-end path metrics of selected node pairs, topology by MMP with the topological information error being
NeuTomography does not know anything about link metrics, (0 < 1), where V = V 0 and E 6= E 0 . Specifically, for
network topologies, routing strategies, or sampling methods link e ∈ E, e ∈ E 0 with probability 1 − ; for link e ∈ / E,
that are used to generate data as discussed below. e ∈ E 0 with probability . Based on the measuring vantages
placed by MMP in G 0 , DAIL determines all link metrics.
1) Link metrics. Unlike Rocketfuel, ITDK in [45] does not However, links in E \ E 0 are not visible to DAIL and links
provide the link metric information; therefore, for the experi- in E 0 \ E do not exist. Therefore, DAIL only utilizes links
ment purpose, regarding AS15706, its link metric distribution in E ∩ E 0 to construct measurement paths for determining
is approximated by AS1221 in Rocketfuel (which has the link metrics in E 0 . For edges in E 0 whose performance
similar number of nodes). Furthermore, besides these link metrics cannot be uniquely determined, they are assigned
metric information in Table I, to extensively study NeuTo- arbitrary values according to the distribution of the inferable
mography, we also consider two other types of link metrics: link metrics. With these link metrics, if the underlying routing
(i) unweighted link metrics, where there is no link metric, and mechanism is given, then the path performance metric for any
(ii) uniform link metrics, where link metrics in the network node pair is computable.
are uniformly distributed between 1 and 10. 2) Arbitrary-valued Non-additive Metric Identification
(ANMI). For non-additive performance metrics, e.g., conges-
2) Routing strategies. To construct a path between two
tions, most existing tomography approaches [28], [29], [30],
nodes, two routing strategies are employed: (i) Min-Hop
[31], [32], [33], [34], [35] target to localize the problematic
Routing (MHR), where a path incurring the minimum number
links under the assumption of known network topology. Cur-
of hops is selected, and (ii) Best Performance Routing (BPR),
rently, the state-of-the-art approaches are capable of either
where w.r.t. a given performance metric of interest, the path
uniquely locate up to k binary link metrics (normal/failed)
with the best performance metric is selected, e.g., shortest path
[31] or locating only one problematic link and determining its
for the metric of delay, least congested path for the metric of
arbitrary-valued link metric [61]. To the best of our knowledge,
congestion level.
there is no tomographic approach that is capable of handling
3) Sampling methods. When the above 1) and 2) are known, arbitrary-valued non-additive performance metrics without the
the end-to-end path performance metrics can be obtained for constraint of the number of problematic links. In this regard,
all node pairs (in set T ). We then sample a subset S of T to we employ an artificial method called Arbitrary-valued Non-
form the input data. We first consider random sampling, where additive Metric Identification (ANMI) that is similar to range
S is randomly picked from T . Since there may exist constraints tomography, but without the constraint of the number of prob-
on measurable pairs in real networks, we next consider an al- lematic links. Specifically, given a tunable threshold parameter
ternative method, called monitor-based sampling. In monitor- τ , when a link performance metric is less than τ , then this link
based sampling, we first randomly select ρ nodes as monitors; is regarded as normal, and problematic otherwise. Suppose
then each monitor pings all other nodes (both monitors and there exists a method that can uniquely localize all problematic
non-monitors) to measure the end-to-end path performance links in the network when the network topology is known.
between them. Thus, under monitor-based sampling, each node Then assuming we are also given the precise performance
pair in S contains at least one monitor. For each of these metric distributions of normal and problematic links in the
sampling methods, the sampling ratio |S|/|T |, is selected from network, we further estimate the fine-grained link metrics
{20%, 25%, 30%} (the number of monitors ρ under monitor- by generating the estimated values according to these given
based sampling is tuned such that the required |S|/|T | is metric distributions. Finally, similar to MMP+DAIL, the path
reached). All path metrics associated with node pairs in T \ S performance metric for any node pair is computable if the
serve as the testing data. underlying routing mechanism is given.
8
In addition to MMP+DAIL and ANMI that are proposed degree is generally between 1 and 5. In this regard, we set
specifically for network tomography, we also compare Neu- γ as γ = 2.5n. Such γ is an overestimation for AS3967,
Tomography against two solutions established in other related AS3257, and AS1221, but an underestimation for AS15706
areas, which are described in the following. (see Table I). In Section V-D, we study how such estimation
3) Non-negative Matrix Factorization (NMF) [78]. NMF inaccuracy affects the performance. For the number of hidden
is widely used in recommendation systems, where the goal layers k, to balance the accuracy and the training time, we set
is to complete the non-negative user-item rating matrix via k = 2. Furthermore, we select the mean square error (MSE)
the product of two lower dimensional matrices. At the high as the loss function, Adam [82] (statistical gradient descent
level, recommendation systems and network tomography share based method) as the optimizer, and 1000 epochs for training.
similar objectives, as they both target to predict the unknown In addition, when the enhanced algorithm PAT is employed,
non-negative entries in a matrix using some given entry values. we set α = 15%, β = 0.6, and #iterations=6 (line 4 in
As such, we use NMF as one benchmark solution. Algorithm 1).
4) Neural Matrix Factorization (NeuMF) [79]. NeuMF
is a neural-network-based solution employing both neural
D. Path Metric Prediction Accuracy
collaborative filtering [80] and matrix factorization for recom-
mendation systems. To use NeuMF as a benchmark, we tune To show the advantages of NeuTomography, we first il-
our measurement data to adapt to the input format of NeuMF, lustrate the distribution of the predicted performance metrics.
which requires the user-item rating be between 0 (dislike) and For instance, in Figure 4, the predicted metric distribution by
1 (like). For non-preferable large performance metrics, e.g., NeuTomography almost overlaps with the actual distribution
delay and the number of hops, we use the reciprocal of the for different performance metric types and sampling methods
measured path metrics as the output of NeuMF; for preferable (PAT is used for monitor-based sampling), while the results
large performance metrics, e.g., bandwidth and delivery ratio, by NMF and NeuMF deviate from the actual distribution
we normalize the path metrics as the NeuMF output. In this significantly. Note that since MMP+DAIL requires the network
way, path metrics (≥ 1) with superior (or poor) performance topology as input, it cannot be compared with other solutions
are mapped to values close to 1 (or 0). under the same settings, thus omitted. In Figure 4, the path
performance metrics in the input training data have different
Remark: There are no existing tomographic solutions that distributions under different sampling methods; nevertheless,
operate under our highly relaxed problem settings. In this NeuTomography can recover the actual performance metric
regard, we choose MMP+DAIL and ANMI, which have been distribution for all unmeasured node pairs.
adapted to our simulation settings, as a representative solutions
In addition to the predicted performance metric distribu-
for traditional tomography problems under strong assumptions.
tion, more importantly, we need to evaluate the path metric
Moreover, NeuTomography is compared to NMF and NeuMF
prediction accuracy for each unmeasured node pair. As such,
algorithms which demonstrates that although the latter two
in this section, we focus on the mean absolute percentage error
work well in tasks sharing similar objectives as our tomog-
(MAPE) as the accuracy evaluation metric; the corresponding
raphy tasks, NeuTomography outperforms them due to its
results under different experiment settings, i.e., (i) link metrics
customized designs.
that are unweighted (UN), from real data (Re), or uniformly
distributed (UD), (ii) best performance routing (BPR) or
min-hop routing (MHR), and (iii) random or monitor-based
C. Experiment Settings sampling, are reported in Tables II–IX. In addition, we also
1) Input Data for Training: We evaluate NeuTomography repeat each experiment under γ = 2n and γ = 3n, and get
on three path performance metrics: (i) the number of hops, (ii) similar results, thus omitted for page limitations. These results
accumulated path delays, and (iii) the path congestion level. therefore confirm the robustness of NeuTomography against
Note that the path bandwidth and the binary normal/failed link number estimation errors.
metric are similar to the path congestion level as all of them are 1) Additive Metrics: The results for additive performance
determined by the worst performed link; therefore, we use the metrics are shown in Table II, where MAPEs less than 15% are
path congestion level as a representative non-additive metric highlighted. For each AS, 20%–30% node pairs are measured
in this section. For each path metric, we generate the input to infer the path performance of the rest unmeasured 70%–
training data via the combination of link metric types, routing 80% node pairs. In Table II, as expected, with the increased
strategies, sampling methods, and sampling ratios as discussed portion of measured node pairs (from 20% to 30%), the
in Section V-A. prediction accuracy is improved for all cases. Moreover, under
2) Framework Parameters: For NeuTomography, we select random sampling, we observe that NeuTomography is excep-
the following parameters. As discussed in Section II-B, the tionally accurate, irrespective of networks, link weight distri-
given measurement data already covers all nodes in the net- butions, or the underlying routing strategies. When 30% ran-
work; therefore, the dimension of the input layer, n, in Figure 2 dom node pairs are measured, the corresponding MAPE ranges
is determined. For the number of links γ (i.e., the number from 2% to 7% for BPR, and 5%–15% for MHR (with 22%
of neurons in hidden layers), we estimate it by the average MAPE for AS3967 as the only exception). Furthermore, when
node degree (defined as 2γ/n) in real networks. As shown the underlying routing strategy coincides with the performance
in [81], in real communication networks, the average node metric of interest (i.e., BPR), such high prediction accuracy
9
pdf
0.04 0.4
0.04 0.2
0.02 0.02 0.2
(a) random sampling (additive) (b) monitor-based sampling (additive) (c) random sampling (non-additive) (d) monitor-based sampling (non-
additive)
Figure 4. Distribution of the predicted (additive/non-additive) path performance metrics (AS3257 under |S|/|T | = 30%, link weights from real data, and
best performance routing).
Table II
PATH P ERFORMANCE P REDICTION E RROR (MAPE IN %) OF N EU T OMOGRAPHY FOR A DDITIVE M ETRICS (UN/R E /UD: LINK METRICS THAT ARE
UNWEIGHTED / FROM REAL DATA / UNIFORMLY DISTRIBUTED , BPR: BEST PERFORMANCE ROUTING , MHR: MIN - HOP ROUTING , MONITOR :
MONITOR - BASED SAMPLING , M +PAT: MONITOR - BASED SAMPLING INPUT DATA AND PAT IS APPLIED )
Table III
PATH P ERFORMANCE P REDICTION E RROR (MAPE IN %) OF MMP+DAIL FOR A DDITIVE M ETRICS
Table IV
PATH P ERFORMANCE P REDICTION E RROR (MAPE IN %) OF NMF FOR A DDITIVE M ETRICS
is achievable even when only 20% node pair measurements of NeuTomography. First, Table I shows that link weights in
are available. By comparison, under monitor-based sampling, AS3967 and AS3257 have higher variance comparing to those
the resulting MAPE is relatively large, especially in the case in AS1221 and AS15706, which potentially causes difficulty
of |S|/|T | = 20%. Nevertheless, when |S|/|T | is increased to in predicting link performance metrics. Nevertheless, when
30%, MAPE is reduced by half for many cases. On the other the underlying routing mechanism is BPR, NeuTomography
hand, even without increasing the amount of training data, is robust against link weight variances and achieves high
using algorithm PAT alone improves the prediction accuracy prediction accuracies for all networks. Intuitively, this is
significantly. As shown in Table II, MAPEs are almost halved because under BPR, the routing mechanism considers both
(some even less than 15%) after applying PAT, which therefore the network topology and link weight information; in other
demonstrates the high efficiency of PAT in reducing the words, node pair measurements incorporate more network
prediction error. information, including the link weight variance. Such rich
information therefore enables the high prediction accuracy
Table II also reveals some insights into the performance
10
Table V
PATH P ERFORMANCE P REDICTION E RROR (MAPE IN %) OF N EU MF FOR A DDITIVE M ETRICS
Table VI
PATH P ERFORMANCE P REDICTION E RROR (MAPE IN %) OF N EU T OMOGRAPHY FOR N ON -A DDITIVE M ETRICS
Table VII
PATH P ERFORMANCE P REDICTION E RROR (MAPE IN %) OF ANMI FOR N ON -A DDITIVE M ETRICS
Table VIII
PATH P ERFORMANCE P REDICTION E RROR (MAPE IN %) OF NMF FOR N ON -A DDITIVE M ETRICS
for NeuTomography. Second, comparing to BPR, the average AS15706 is relatively densely connected, which provides more
MAPE under MHR is substantially larger. This is because next-hop options when constructing end-to-end paths, thus
MHR only captures the network topology information while leading to smaller MAPE.
the link weight information is lost. Nevertheless, under random
sampling and UD, MAPE is generally less than 15%, even We now compare our solution to the benchmarks (in Ta-
if the underlying routing mechanism is MHR. Furthermore, bles III–V). For MMP+DAIL, we know from [12], [25] that
Table II implies that the effect of link weight variance becomes it does not incur any error if all assumptions are completely
prominent under MHR. Specifically, MAPE is improved (even satisfied. However, as shown in Table III, it is extremely vul-
less than 3%) in AS1221 and AS15706 with smaller link nerable to the topology error. Specifically, when the topology
weight variance. This observation suggests that NeuTomog- error is only 0.5%, the MAPE can be up to 37%; when the
raphy is capable of dealing with different types of routing topology error is increased to 2%, then MAPE is deterio-
mechanisms so long as the link weight variance is relatively rated to around 50%. This result shows the advantages of
small. Finally, recall that links in AS1221 and AS15706 have NeuTomography, for which no additional network information
the same weight distribution; however, we observe that the is required while still achieving superior performance. In
average MAPE in AS15706 is smaller, which can be explained addition, under the same experiment setting, our solution sig-
by the network structural properties: The average node degree nificantly outperforms NMF and NeuMF with up to one order
is 4.7 and 5.4 for AS1221 and AS15706, respectively. Thus, of magnitude reduction in MAPE, which demonstrates the
superiority of NeuTomography. We note that NeuMF performs
11
Table IX
PATH P ERFORMANCE P REDICTION E RROR (MAPE IN %) OF N EU MF FOR N ON -A DDITIVE M ETRICS
relatively well for AS15706 when link metrics are from real path is independent of the link congestion level. Hence, when
data. This is because unlike NMF that only leverages linear the variance of the link congestion level is large, there is a
transformations, NeuMF is a neural-network-based model, lack of link information embedded in the path measurements.
which is equipped with the capability in capturing non-linear Nevertheless, when the link congestion variance is relatively
relationships. However, even with such improvement, NeuMF small, i.e., in AS1221 and AS15706, NeuTomography exhibits
is still inferior to our proposed solution. high accuracy. Furthermore, the large node degree in AS1221
and AS15706 also shortens the average path length under
Discussions on PAT: In Table II, only the results of PAT for
MHR, which reduces the likelihood of constructing a path with
monitor-based sampling are presented. We also test PAT under
a bottleneck link, thus improving the prediction accuracy.
random sampling and get similar results (thus omitted for
page limitations). This is because PAT is proposed mainly for Besides, we also observe that monitor-based sampling gen-
addressing the overfitting problem during training. For random erally incurs larger MAPE under MHR, especially when the
sampling, it is unbiased in the sense that path performance link congestion level is from the real data and the networks
metrics in the input data closely represents the performance are AS3967 and AS3257. Again, this observation shows that
metric distribution in the testing data. However, monitor-based when the variance of the link congestion level is small,
sampling is biased, which causes overfitting, and PAT is able NeuTomography is able to recover the network information
to alleviate the effect of biased sampling. that is critical to the performance prediction even when the
2) Non-additive Metrics: For non-additive performance underlying routing mechanism is independent of the perfor-
metrics (Table VI), with the increased portion of measured mance metric of interest. Furthermore, under MHR, we test
node pairs (from 20% to 30%), the prediction accuracy is also PAT for monitor-based sampling. The results in Table VI
improved. Furthermore, under BPR, NeuTomography achieves show that when the performance metric is non-additive, PAT
significant performance (MAPE< 3.5%) for all cases. This is only slightly reduces MAPE or achieves similar performance
because, regarding the performance metric of interest (conges- as the one without PAT. This implies that under MHR and
tion level), the goal for BPR is to find the least congested path. non-additive performance metrics, it is difficult to know the
Specifically, for a node pair, if there exists a path bypassing performance bound of an unmeasured node pair especially
all highly congested links, then its performance metric is when the link congestion level exhibits high variance.
small. For the tested AS topologies, most end-to-end path
Regarding benchmark solutions, evaluation results of ANMI
performance metrics fall into the narrow region of 3–9, which
is shown in Table VII. The threshold ratio here is the ratio
therefore simplifies the performance inference task. Neverthe-
of normalized τ (i.e., τ minus the minimum link metric
less, our objective is not only to identify the coarse-grained
value) and the range of the metric value distribution (i.e.,
congestion range, but also to determine the fine-grained con-
the difference between the maximum and the minimum link
gestion level. In this regard, NeuTomography achieves high
metric values). From the results in Table VII, we can see
performance inference accuracy for each node pair without
that although ANMI performs relatively well for AS1221
any other network information as prior knowledge. For non-
under BPR with real link metrics, NeuTomography still offers
additive metrics, since using only 20% node pairs already
an order of magnitude performance improvements for the
enables superior performance (MAPE< 3.5%) under BPR,
same setting. Moreover, NeuTomography outperforms NMF
PAT is not employed for the performance improvement.
by up to two orders of magnitude (Table VIII). While for
By contrast, under MHR, MAPE experiences various error
NeuMF, although its MAPE is less than 14% under BPR (see
levels. When link congestion levels are uniformly distributed,
Table IX), NeuTomography still shows one order of magnitude
the corresponding MAPE is between 1.6% and 18% for all
of improvement on average.
networks. However, under real link congestion level distribu-
tion, MAPE is large for AS3967 and AS3257, while still small In sum, both results on additive/non-additive performance
for AS1221 and AS15706 (1%–15%). As discussed before, metrics confirm the high efficiency and applicability of NeuTo-
this is caused by the link variance and limited information mography in real networks without relying on the knowledge
in the path measurements. In Table I, the variance of the of additional network information (e.g., network topology) or
link congestion level is severe for AS3967 and AS3257 (78 rigorous assumptions (e.g., controllable routing), thus provid-
and 41.3 respectively). Moreover, under MHR, the routing ing a lightweight and robust solution.
12
Table X
T OPOLOGY R ECONSTRUCTION E RROR (FPR AND FNR IN %) W. R . T. E XTENDED A DJACENCY M ATRIX A(m)
AS3967 (FPR/FNR: False Positive/Negative Rate in %) AS3257 (FPR/FNR: False Positive/Negative Rate in %)
|S|sampling m=1 m=2 m=3 m=4 m=5 m=1 m=2 m=3 m=4 m=5
|T |method FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR
random 0.4 52.9 2.7 24.3 3.9 18.7 3.5 14.2 2.4 10.0 0.2 59.4 1.5 26.2 2.8 20.2 2.6 12.5 1.7 6.8
20% monitor 2.1 73.0 4.6 68.1 7.4 62.7 9.3 57.2 14.2 53.7 0.5 77.1 1.8 69.0 4.3 63.1 6.6 57.4 10.2 52.5
m+PAT 0.0 72.0 0.4 65.9 3.0 54.8 8.6 43.0 12.4 33.7 0.0 76.5 0.1 67.0 1.9 62.1 6.6 55.2 10.2 49.9
random 0.5 38.9 2.1 21.7 3.1 16.1 3.0 11.3 2.5 9.3 0.2 50.9 1.3 20.5 2.1 14.6 1.8 9.2 1.1 4.9
25% monitor 1.6 67.5 4.2 58.7 7.8 52.1 10.9 50.9 12.3 50.8 0.2 71.4 1.5 60.1 4.0 53.7 6.6 49.2 9.5 42.2
m+PAT 0.0 72.9 0.5 59.8 3.1 46.0 7.7 32.5 9.9 23.6 0.0 71.5 0.2 59.1 2.3 49.8 5.9 40.0 8.2 32.8
random 0.3 43.7 1.5 16.4 1.8 8.6 1.2 4.8 0.8 3.2 0.2 43.0 1.1 16.8 1.7 11.6 1.3 6.7 0.8 3.9
30% monitor 0.6 61.5 2.7 56.1 6.0 48.5 8.9 43.8 10.5 39.1 0.1 52.6 1.0 27.2 2.5 18.2 2.7 12.2 2.0 7.8
m+PAT 0.0 60.2 0.6 53.3 2.8 42.7 7.4 32.4 9.4 21.9 0.0 52.9 0.2 27.1 2.2 17.1 4.0 11.2 2.0 7.9
AS1221 (FPR/FNR: False Positive/Negative Rate in %) AS15706 (FPR/FNR: False Positive/Negative Rate in %)
|S|sampling m=1 m=2 m=3 m=4 m=5 m=1 m=2 m=3 m=4 m=5
|T |method FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR FPR FNR
random 0.2 22.4 0.5 8.1 1.0 3.8 0.8 3.0 0.4 2.1 0.1 7.4 0.3 1.1 0.8 0.7 0.5 1.4 0.2 5.0
20% monitor 0.9 76.0 3.2 64.9 6.4 64.6 10.6 62.3 11.6 57.4 5.0 60.3 7.6 50.9 21.3 39.0 18.3 39.6 4.3 40.7
m+PAT 0.0 75.5 0.2 61.8 2.5 62.6 10.9 56.1 8.9 42.1 1.7 55.6 6.3 43.5 13.4 30.0 15.7 33.4 4.0 33.1
random 0.1 21.2 0.4 5.6 0.4 3.0 0.2 1.1 0.2 0.7 0.1 6.6 0.2 0.5 0.4 0.6 0.1 0.8 0.1 1.8
25% monitor 0.4 67.4 2.3 54.9 5.7 54.7 8.5 55.6 11.6 49.2 2.5 47.2 5.4 24.9 13.2 20.6 8.9 20.2 1.6 18.2
m+PAT 0.0 65.0 0.3 49.6 2.0 43.2 7.7 32.3 8.6 26.0 2.4 46.5 5.0 21.1 12.0 18.5 8.3 15.5 1.4 16.5
random 0.1 19.5 0.5 3.6 0.6 1.9 0.4 1.7 0.2 1.4 0.0 5.2 0.1 0.3 0.2 0.2 0.2 0.3 0.0 1.3
30% monitor 0.1 58.6 1.7 43.4 5.1 38.2 5.3 44.1 7.8 37.9 0.3 44.4 4.4 18.7 8.2 19.4 8.2 15.7 1.8 17.8
m+PAT 0.0 55.2 0.3 41.7 2.5 38.6 5.5 29.6 7.9 23.5 0.3 41.2 4.0 13.2 6.1 14.7 7.8 12.3 1.6 14.6
E. Topology Reconstruction Accuracy Moreover, under monitor-based sampling, PAT yields low
reconstruction error for m ≥ 2. This result implies that the
When the performance metric of interest is the minimum topology reconstruction is more accurate in networks with
number of hops, we use NeuTomography to reconstruct the high density. This is because in dense networks, there exist
network topology in terms of the extended adjacency matrix. more node pairs which are close to each other; therefore, the
To test the reconstruction accuracy, intuitively, we can use the probability of these close node pairs that are selected for mea-
(m) 0(m)
matrix difference ( i j |Ai,j − Ai,j |)/n2 as the evalu-
P P
surements are increased, which assists the learning process in
(m) 0(m)
ation metric, where A and A are the real and con- NeuTomography. For benchmarks NMF and NeuMF, they can
structed extended adjacency matrices, respectively. However, also be used to construct A(m) ; however, their performance
since the number of links in a network is generally much is substantially worse than NeuTomography, thus omitted
smaller than n2 , even a full zero matrix A0(m) leads to a due to page limitations. In sum, NeuTomography provides a
small matrix difference. Therefore, we use the False Positive state-of-the-art solution to reconstruct network topologies with
Rate (FPR) and False Negative Rate (FNR) as the evaluation various granularities using only a small percentage of node
metric. Specifically, let A(m) and A0(m) be the real and pair measurements without additional network knowledge.
constructed extended adjacency matrices, and τ the number of
non-zero elements in A(m) . Then FPR is the number of non- VI. C ONCLUSION
zero elements in A0(m) that are zeros in A(m) over n2 − τ ;
similarly, FNR equals the number of zero elements in A0(m) We revisited the problem of network tomography from the
that are non-zeros in A(m) over τ . The reconstructed network practical perspective. Without relying on any assumptions on
topology is accurate if both FPR and FNR are small. The network topologies, protocol support, or measurement metric
corresponding results are reported in Table X, where both FPR properties as in the literature, we established a generic tomog-
and FNR less than 15% are highlighted. First, for extended raphy framework, NeuTomography, to infer unknown network
adjacency matrices, FPR is small for all cases as usually characteristics using only end-to-end path performance metrics
τ n2 , and thus the denominator n2 − τ is much larger than of selected node pairs. Next, regarding the potential overfitting
the numerator. Second, as expected, the increased number of problem, we proposed one algorithm that utilizes active perfor-
measured node pairs is beneficial in improving the topology mance bound estimation as the augmented data for iteratively
reconstruction accuracy. Third, under random sampling, A(1) , improving the performance prediction accuracy. Furthermore,
i.e., m = 1, is mostly inaccurate (except for AS15706). we investigated the feasibility of employing NeuTomography
Nevertheless, when m is increased to 2, FNR is reduced by to reconstruct the network topology under the given limited
over a half. Specifically, w.r.t. A(2) , for both AS1221 and measurement data. Extensive experiments using real network
AS15706, FPR and FNR are less than 4% when 30% random data show that NeuTomography is robust against network
node pairs are measured. Fourth, monitor-based sampling parameter errors and exhibits high prediction accuracies for
yields high topology reconstruction error; nevertheless, for both additive and non-additive performance metrics, which
some networks, i.e., AS15706, FNR is reduced to be less than is up to orders of magnitude improvement over benchmark
15% via PAT. Finally, since AS1221 and AS15706 have the solutions. Besides, with small errors in terms of extended
same link congestion level distribution, Table X demonstrates adjacency matrices, the reconstructed network topologies also
that for the denser network AS15706, even A(1) is accurate. provide vital insights to network operational optimizations.
13