Algorithms For Next Generation Networks Jiayue He
Algorithms For Next Generation Networks Jiayue He
123
Editors
Dr. Graham Cormode Dr. Marina Thottan
AT&T Research Bell Labs
Florham Park Murray Hill
NJ, USA NJ, USA
[email protected] [email protected]
Series Editor
Professor A.J. Sammes, BSc, MPhil, PhD, FBCS, CEng
Centre for Forensic Computing
Cranfield University
DCMT, Shrivenham
Swindon SN6 8LA
UK
ISSN 1617-7975
ISBN 978-1-84882-764-6 e-ISBN 978-1-84882-765-3
DOI 10.1007/978-1-84882-765-3
Springer London Dordrecht Heidelberg New York
“A few months ago a colleague at Georgia Tech asked me ‘Which are the top-10
most significant algorithms that have been developed by networking research in the
last decade?’ This book would provide an excellent answer. Covering a wide range
of topics from the entire spectrum of networking research, this book is highly recom-
mended for anyone that wants to become familiar with the ‘network algorithmics’
state-of-the-art of the evolving Internet.”
v
Foreword
Over the past 20 years, computer networks have become a critical infrastructure in
the global economy and have changed people’s lives in ways that could not have
been predicted in the early days of the Internet. Many technological advances con-
tributed to these changes, including Moore’s Law, the emergence of the World Wide
Web, and concurrent advances in photonic transmission and disk storage capacity.
These advances also relied on an economic climate that allowed innovation to flour-
ish, in Silicon Valley and in other parts of the world. This story has been told many
times. One part of the story that is not widely appreciated is the critical role that
has been played by advances in algorithms research over this time. Advances in
technology occur when engineers have both the theoretical understanding and the
practical means to address an important technical problem. In this foreword, I will
look briefly at a few of the major advances in algorithms that have fueled the growth
of computer networks over the last 20 years. As we look ahead to the next generation
of advances in networking, this book provides a survey of key areas in algorithms
research that will be important to the researchers and architects working in our field.
The first graphical web browser, Mosaic, made its debut in 1993 – an event that
is often associated with the beginning of the growth of the web. In the late 1990s,
as the use of the web grew, few consumers had access to the Internet at speeds
greater than 1 Megabit/s, and web page download times were slow. By placing web
caches in ISP networks close to the end users, content delivery networks (CDNs)
emerged to support “web acceleration”. This situation led to a flurry of research in
Internet mapping and the development of practical algorithms for DNS-based load
balancing. Although broadband speeds have increased significantly since that time,
CDNs are widely used to efficiently deliver web and multimedia content. Variants
of the measurement tools and algorithms developed in the late 1990s are still in use
by CDN providers today.
As the Internet grew, one of the challenges that might have limited the growth
of the global Internet was scalable IP routing lookups. In the late 1990s, to better
manage the utilization of the IPv4 address space the IETF published a new way
of allocating IP addresses, called Classless Interdomain Routing (CIDR), along
with a new method for routing IP packets based on longest prefix match. The
need for longest prefix match routing forced the development of efficient hardware
vii
viii Foreword
and software algorithms, which were developed during several years of vigorous
research and subsequently implemented in routers and other network products. Vari-
ants of these algorithms continue to be used today.
The increase in the number of connected Internet users also led to the emergence
of peer-to-peer (P2P) file sharing networks. P2P applications such as Napster and
Gnutella emerged around 2000, and were used to exchange music and other forms
of digital content. The rapid growth of distributed P2P applications inspired the
invention of Distributed Hash Tables (DHTs). DHTs provide an efficient and very
general distributed lookup service that is a fundamental building block in today’s
P2P applications such as the popular BitTorrent and in other overlay networks.
The rapid development of new Internet applications caused ISPs and businesses
to need new tools to better understand the traffic on their networks. Router vendors
developed mechanisms for exporting flow records that could be used for this pur-
pose, but the huge volume of flow data required the invention of statistically robust
methods for sampling flow data, such as “smart sampling”. More generally, the vast
amount of data transferred on the Internet has sparked a generation of research on
algorithmic approaches to handling data streams.
The increasing amount of content on the web also led to the need for tools to
search the web. While at Stanford University, Larry Page invented the original and
now famous PageRank algorithm, which assigns a numerical weight to linked docu-
ments in a database based on the frequency of links to each document. The original
PageRank algorithm has been enhanced over nearly a decade in an effort to improve
web search. The combination of web search with online advertising created a dy-
namic online marketplace. This has led to a new industry focused on search engine
optimization.
A discussion of the advances in computer networks over the past two decades
would not be complete without mentioning the emergence of wireless data
networks. The viral spread of 802.11-based WiFi networks and the more recent
deployment of 3G cellular data services have made mobile access to information
possible. Among the most important algorithmic advances, one must certainly
list the invention of Multiple-Input-Multiple-Output (MIMO) antenna systems, and
related work on Space-Time Coding. Since the first practical implementation of spa-
tial multiplexing in the late 1990s, there has been over a decade of research aimed
at improving the performance of wireless communication links that continues to
this day.
Finally, optimization problems emerge in many aspects of computer networks.
The results of decades of research in optimization are used daily in the design of
large-scale networks. While optimization may not be as visible to users of the net-
work as web search or wireless access, it is a critical tool in the network designer’s
toolkit.
It is likely that the next two decades will bring about continued advances in
computer networks that will depend as much on sound algorithmic principles as
the advances of the last two decades have. Efficient spectrum management and
techniques for limiting interference will be critical to wireless networks. Overlay
networks will be designed to operate efficiently in conjunction with the IP layer.
Foreword ix
Techniques for packet processing and network monitoring will be developed to sup-
port the needs of large ISP networks. Network-based applications such as online
gaming and social networks will continue to evolve, alongside new applications that
have not yet been invented. Algorithms are at the heart of the networks that we use.
This book offers a broad survey of algorithms research by leading researchers in the
field. It will be a valuable tool for those seeking to understand these areas of work.
Since the early 1990s coupled with the widespread deployment of broadband to
the home, we have seen remarkable progress in the ease of Internet accessibility to
end users. Both commercial and private sectors rely heavily on the availability of
the Internet to conduct normal day-to-day functions. Underpinning this exponential
growth in popularity of the Internet are the advances made in the applications of
basic algorithms to design and architect the Internet. The most obvious example of
these algorithms is the use of search engines to collect and correlate vast amounts
of information that is spread throughout the Internet.
With the dawn of this new century, we are now on the verge of expanding the
notion of what we mean to communicate. A new generation of netizens are poised to
leverage the Internet for a myriad different applications that we have not envisioned
thus far. This will require that the Internet be flexible and adapt to accommodate
the requirements of next-generation applications. To address this challenge, in the
United States, the National Science Foundation has initiated a large research project
GENI. The goal of GENI is to perform a clean-slate design for a new Internet.
In particular, the aim of this project is to rethink the basic design assumptions on
which the current Internet is built, with the possibility that to improve flexibility for
new services we may arrive at a radically different Internet, beyond what one might
imagine from evolving the current network. Given this context of Internet research,
the purpose of this book is to provide a comprehensive survey of present algorithms
and methodologies used in the design and deployment of the Internet. We believe
that a thorough understanding of algorithms used by the Internet today is critical to
develop new algorithms that will form the basis of the future Internet.
The book is divided into three parts dealing with the application of algorithms
to different aspects of network design, operations, and next-generation applications.
Part I provides an algorithmic basis for the design of networks both at the physi-
cal and the service layer. This part is extensive since it considers different physical
layer network technologies. The first chapter in this part outlines the goals for opti-
mization in network design by considering both the optimizability of protocols and
the optimum placement of network functionality. The general idea of Valiant load
balancing, and its application in the context of efficient network design, is presented
in Chapter 2.
xi
xii Preface
coding and its applications to communication networks. The algorithms that un-
derlie the ubiquitous Internet search applications such as Yahoo and Google are
surveyed in Chapter 16. However, as the nature and the modes of usage of this con-
tent on the Internet evolves the requirements of these search engine algorithms must
also evolve. This chapter provides perspectives on how this evolution is taking place.
Until today the Internet has been used primarily for point-to-point and mostly for
non-interactive communications. However, this is quickly changing as evidenced by
the growing online gaming industry. Online gaming and the algorithms that imple-
ment these games at the client, server, and network are discussed in Chapter 17.
Online gaming has opened up the possibility of exploring more real-time communi-
cations such as Telepresence on the Internet. It is expected that the next generation
of communication services will leverage this interactively to make communications
more content-rich. Social networking is one example of next-generation communi-
cations. Using online social networks, communities are being built across the world
based on user group interests. Algorithms that attempt to describe the building and
evolution of these online social networks are discussed in Chapter 18. These social
networks also have the potential to evolve into interactive communication groups,
therefore understanding their evolution patterns is critical from a network operations
and management perspective.
In this book we have attempted to provide a flavor of how algorithms have formed
the basis of the Internet as we know it today. It is our hope that this book will provide
a useful overview of algorithms applied to communication networks, for any student
who aspires to do research in network architecture as well as the application of
algorithms to communication networks. We believe that for a robust design of the
future Internet, it is essential that the architecture be founded on the basis of sound
algorithmic principles.
We first thank the authors who contributed directly to this book by their willingness
to contribute a chapter in their areas of expertise. We would like to acknowledge
the support of our respective managements at AT&T and Alcatel-Lucent Bell Labs,
especially Divesh Srivastava at AT&T and T.V. Lakshman at Bell Labs. This book
grew out of a workshop held at the DIMACS center in Rutgers University in 2007.
We thank the DIMACS staff and leadership for their help in starting this project. We
would like to thank Simon Rees and his team at Springer for their efforts to publish
and market this book. Lastly, we thank all those who helped us by reviewing drafts
of chapters and providing feedback to the authors to help them revise and improve
their contributions. The reviewers include:
Graham Cormode
Marina Thottan
xv
Contents
xvii
xviii Contents
Index . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .457
List of Contributors
David Amzallag BT
Grenville Armitage Swinburne University of Technology
Thomas Bengtsson Bell Labs
Li-Wei Chen Vanu
Yan Chen Northwestern University
Mung Chiang Princeton University
Chen-Nee Chuah UC Davis
Mads Dam KTH Royal Institute of Technology
Yanlei Diao University of Massachusetts
Debora Donato Yahoo! Research
Constantine Elster Qualcomm Israel
Aristides Gionis Yahoo! Research
Jiayue He Princeton University
Tin Kam Ho Bell Labs
Enrique Hernandez-Valencia Bell Labs
Chuanyi Ji Georgia Institute of Technology
Randy H. Katz UC Berkeley
Ram Keralapura Narus
Adam Kirsch Harvard University
V.S. Anil Kumar Virginia Tech
T.V. Lakshman Bell Labs
David Liben-Nowell Carleton College
Guanglei Liu Roane State Community College
xix
xx List of Contributors
1.1 Introduction
In this paper, we argue that the difficulty of solving the key optimization prob-
lems is an indication that we may need to revise the underlying protocols, or even
the architectures, that lead to these problem formulations in the first place. We ad-
vocate the design of optimizable networks – network architectures and protocols
that lead to easy-to-solve optimization problems, and consequently, optimal solu-
tions. Indeed, the key difference between “network optimization” and “optimizable
networks” is that the former refers to solving a given problem (induced by the ex-
isting protocols and architectures) while the latter involves formulating the “right”
problem (by changing protocols or architectures accordingly).
The changes to protocols and architectures can range from minor extensions to
clean-slate designs. In general, the more freedom we have to make changes, the eas-
ier it would be to create an optimizable network. On the other hand, the resulting
improvements in network management must be balanced against other considera-
tions such as scalability and extensibility, and must be made judiciously. To make
design decisions, it is essential to quantify the trade-off between making network
management problems easier by changing the problem statement and the extra over-
head the resulting protocol imposes on the network.
Network optimization has had a particularly large impact in the area of traffic
management, which controls the flow of traffic through the network. Today, this
spans across congestion control, routing, and traffic engineering. We start by intro-
ducing the notation used in this paper in Table 1.1. In Section 1.2, we describe how
optimization is used in traffic management today. In Section 1.3, we illustrate de-
sign principles which we have uncovered through our own research experiences on
traffic management. Traffic management is an extremely active area of research, but
we will not address related work in this paper since these examples are included to
serve as illustrations of general principles. In Section 1.4, we discuss other aspects
of traffic management, such as interdomain routing and active queue management,
where the problems are even more challenging. We also examine the trade-off be-
tween performance achieved and overhead imposed when designing optimizable
protocols. We conclude and point to future work in Section 1.5.
Inside a single AS, each router is configured with an integer weight on each of its
outgoing links, as shown in Figure 1.2. The routers flood the link weights throughout
the network and compute shortest paths as the sum of the weights. For example,
i directs traffic to k though the links with weights (2,1,5). Each router uses this
information to construct a table that drives the forwarding of each IP packet to the
optimization
Routing model
measurement
control
Operational network
Fig. 1.1 Components of the
route optimization framework
next hop in its path to the destination. These protocols view the network inside an
AS as a graph where each router is a node n 2 N and each directed edge is a
link l 2 L between two routers. Each unidirectional link has a fixed capacity cl , as
well as a configurable weight wl . The outcome of the shortest-path computation can
be represented as rl.i;j / : the proportion of the traffic from router i to router j that
traverses the link l.
Operators set the link weights in intradomain routing protocols in a process
called traffic engineering. The selection of the link weights wl should depend on
the offered traffic, as captured by a demand matrix whose entries x .i;j / represent
the rate of traffic entering at router i that is destined to router j . The traffic ma-
trix can be computed based on traffic measurements [4] or may represent explicit
subscriptions or reservations from users. Given the traffic demand x .i;j / and link
P .i;j /
weights wl , the volume of traffic on each link l is yl D i;j x .i;j / rl , the pro-
portion of traffic that traverses link l summed over all ingress–egress pairs. An
objective function can quantify the “goodness” of a particular setting of the link
weights. For traffic engineering,
P the optimization considers a network-wide objec-
tive of minimizing l f .yl =cl /. The traffic engineering penalty function f is a
convex, non-decreasing, and twice-differentiable function that gives an increasingly
heavy penalty as link load increases, such as an exponentialP function. The problem
traffic engineering solves is to set link weights to minimize l f .yl =cl /, assuming
the weights are used for shortest-path routing.
So far, we have covered the impact of link weights inside an AS. When a net-
work, such as an Internet service provider (ISP) backbone, can reach a destination
through multiple egress points, a routing change inside the AS may change how
traffic leaves the AS. Each router typically selects the closest egress point out of a
set of egress points which can reach a destination, in terms of the intradomain link
weights wl , in a practice known as early-exit or hot-potato routing [5]. In the exam-
ple in Figure 1.3, suppose a destination is reachable via egress points in New York
City and San Francisco. Then traffic from Dallas exits via New York City rather
than San Francisco since the intradomain path cost from Dallas to New York City
is smaller. If the traffic from Dallas encounters congestion along the downstream
path from New York City in Figure 1.3, the network operators could tune the link
weights to make the path through San Francisco appear more attractive. Controlling
where packets leave the network, and preventing large shifts from one egress point to
another, is an important part of engineering the flow of traffic in the network. Models
can capture the effects of changing the link weights on the intradomain paths and
the egress points, but identifying good settings of the weights is very difficult.
Traffic management today has several strengths. First, routing depends on a very
small amount of state per link, i.e., link weights. In addition, forwarding is done
hop-by-hop, so that each router decides independently how to forward traffic on its
outgoing links. Second, routers only disseminate information when link weights or
topology change. Also, TCP congestion control is based only on implicit feedback
of packet loss and delay, rather than explicit messages from the network. Third, the
selection of link weights can depend on a wide variety of performance and reliabil-
ity constraints. Fourth, hot-potato routing reduces internal resource usage (by using
the closest egress point), adapts automatically to changes in link weights, and al-
lows routers in the AS to do hop-by-hop forwarding towards the egress point. Last
but not least, the decoupling of congestion control and traffic engineering reduces
complexity through separation of concerns.
On the other hand, today’s protocols also have a few shortcomings. To start with,
optimizing the link weights in shortest-path routing protocols based on the traffic
matrix is NP-hard, even for simplest of objective functions [6]. In practice, local-
search techniques are used for selecting link weights [6]; however, the computation
time is long and, while the solutions are frequently good [6], the deviation from the
optimal solution can be large. Finding link weights which work well for egress point
selection is even more challenging, as this adds even more constraints on how the
weights are set.
There are other limitations to today’s traffic management. The network opera-
tor can only indirectly influence how the routers forward traffic, through the setting
of the link weights. Further, traffic engineering is performed assuming that the of-
fered traffic is inelastic. In reality, end hosts adapt their sending rates to network
congestion, and network operators adapt the routing based on measurements of the
traffic matrix. Although congestion control and routing operate independently, their
decisions are coupled. The joint system is stable, but often suboptimal [7]. Further-
more, traffic engineering does not necessarily adapt on a small enough timescale
to respond to shifts in user demands. In addition to timescale alternatives, there are
also choices as to geographically which part of traffic management work should be
carried out inside the network, and which by the sources. These limitations suggest
that revisiting architectural decisions is a worthy research direction.
Some optimization problems involve integer constraints, which are not convex,
making them intractable and their solutions suboptimal. Relaxing the integer con-
1 Design for Optimizability: Traffic Management of a Future Internet 9
Fig. 1.6 Routers forwarding traffic with exponentially diminishing proportions of the traffic
directed to the longer paths. Arrows indicate paths that make forward progress towards the des-
tination, and the thickness of these lines indicates the proportion of the traffic that traverses these
edges
10 J. He et al.
where w.i;j
k
/
is the sum of the link weights on the kth path between router i and j .
So as in OSPF and IS–IS today, each router would compute all the path weights for
getting from i to j , there is just an extra step to compute the splitting ratios. For
example, in Figure 1.6, consider the two lower paths of costs 8 (i.e., 2 C 1 C 5)
and 9 (i.e., 2 C 4 C 3), respectively. The path with cost 8 will get e 8 =.e 8 C e 9 /
of the traffic, and the path with cost 9 will get e 9 =.e 8 C e 9 / of the traffic.
Under this formulation, both link weights and the flow splitting ratios are vari-
ables. This enlarges the constraint set, and the resulting constraints are much easier
to approximate with convex constraints. Consequently, the link-weight tuning prob-
lem is tractable, i.e., can be solved much faster than the local search heuristics today.
In addition, the modified protocol is optimal, i.e., makes the most efficient use of
link capacities, and is more robust to small changes in the path costs. The optimality
result is unique to this particular problem where there is an intersection between
the set of optimal protocols and protocols based on link weights. In general, the
optimality gap is reduced by enlarging the constraint set, as seen in a similar pro-
posed extension to OSPF and IS–IS [9]. By changing the constraint set, [8, 9] retain
the simplicity of link-state routing protocols and hop-by-hop forwarding, while in-
ducing an optimization problem that is both faster to solve and leads to a smaller
optimality gap.
while adapting automatically to network changes, the metric qd.i;j / includes both
configurable parameters and values computed directly from a real-time view of the
topology. In particular, qd.i;j / D ˛d.i;j / w.i;j / C ˇd.i;j / where ˛ and ˇ are configurable
values [10]. The first component of the equation supports automatic adaptation to
topology changes, whereas the second represents a static ranking of egress points per
ingress router. Providing separate parameters for each destination prefix allows even
greater flexibility, such as allowing delay-sensitive traffic to use the closest egress
point while preventing unintentional shifts in the egress points for other traffic.
Consider a scenario where ˛ and ˇ are tuned to handle failure scenarios. As seen
in Figure 1.7, the ingress router c can reach a destination through egress routers a
and b. There are three paths from c to a with paths costs 9, 11, and 20, respectively,
the path cost from c to b is 11. The goal is to not switch the traffic from leaving
egress router a if the path with cost of 9 fails, but do switch to egress B if the
path with cost 11 fails also. This can be expressed as a set of conditions as in the
following equation:
convexity issue. That is, the optimization problem becomes solvable in polynomial
time if we allow an ingress point i to split traffic destined to d over multiple egress
points e, rather than forcing all traffic from i to go to a single egress point; in prac-
tice, solving the relaxed problem produces integer solutions that do, in fact, direct
all traffic from i to d via a single egress point e. Overall, by increasing the degrees
of freedom, a management system can set the new parameters under a variety of
constraints that reflect the operators’ goals for the network [10]. Not only does the
network become easier to optimize, but the performance improves as well, due to
the extra flexibility in controlling where the traffic flows.
not care too much whether a link is 20% loaded or 40% loaded [6]. One way to
combine the objectives of traffic engineering and congestion control is to construct
a weighted sum of utility and link cost functions as the overall objective for traffic
management [12], where v is the weight between the two objectives:
P P P
Maximize i U.x .i;j / / v l f . i;j x .i;j / rl.i;j / =cl /
P (1.4)
Subject to i;j x .i;j / rl.i;j / cl ; x 0:
In [12], we revisit the division of labor between users, operators, and routers. In
this case, we allow for a per path multicommodity flow solution, hence resulting in
a convex problem, and opens up many standard optimization techniques that derive
distributed and iterative solutions. In its current form, (1.4) has a non-convex con-
straint set, which can be transformed into a convex set if the routing is allowed to be
multipath. To capture multipath routing, we introduce xk.i;j / to represent the sending
rate of router i to router j on the kth path. We also represent available paths by a
matrix H where
.i;j / 1; if path k of pair .i; j / uses link l
Hl;k D
0; otherwise.
H does not necessarily present all possible paths in the physical topology, but a sub-
set of paths chosen by operators or the routing protocol. Using the new notation, the
P .i;j / .i;j /
capacity constraint is transformed into i;j;k xk Hl;k cl , which is convex.
Decomposition is the process of breaking up a single optimization problem into
multiple ones that can be solved independently. As seen in Figure 1.8, decomposing
the overall traffic management optimization problem, a distributed protocol is de-
rived that splits traffic over multiple paths, where the splitting proportions depend on
feedback from the links. The links send feedback to the edge routers in the form of a
price s that indicates the local congestion level, based on local link load information.
Although there are multiple ways to decompose the optimization problem, they all
lead to a similar divisions of functions between the routers and the links [12].
Fig. 1.8 A high-level view of how the distributed traffic-management protocol works
14 J. He et al.
The principles introduced in the previous section are a useful first step towards de-
signing optimizable protocols, but are by no means comprehensive. The merits of
proposed optimizable protocols should always be balanced with any extra overhead
in practical implementation and robustness to changing network dynamics. In addi-
tion, the principles introduced in the previous section focus on intradomain traffic
management, and do not address all the challenges in end-to-end traffic manage-
ment. Finally, when deriving new architectures, the balance between performance
and other factors is even more delicate.
available. For striking the right trade-offs in the design of optimizable networks, it is
important to find effective ways to quantify the acceptable amount of deviation from
the optimal solution. There are also well-established, quantitative measures of the
notions of how easily solvable an optimization is. These quantitative measures can
help determine how much the protocols and architectures need to change to better
support network management.
The protocols today are designed with certain assumptions in mind, e.g., single-
path routing and hop-by-hop forwarding. Some of these assumptions cause the
resulting optimization problem to be intractable, e.g., single-path routing, while oth-
ers do not, e.g., hop-by-hop forwarding. By perturbing the underlying assumptions
in today’s protocols, we can achieve a different point in the trade-off space of op-
timality versus simplicity. Therefore, it is worth exploring the alternatives, even if
at the end the decision is to keep the original protocol and architectures. In order to
choose between protocol designs, the key is to gain a deeper understanding of the
trade-offs. As such, we believe that design for optimizability can be a promising,
new interdisciplinary area between the systems and theory communities.
Our examples thus far focused on optimization problems in intradomain traffic man-
agement. Routing within a single domain side-steps several important issues that
arise in other aspects of data networking, for several reasons:
A single domain has the authority to collect measurement data (such as the traffic
and performance statistics) and tune the protocol configuration (such as the link
weights).
The routing configuration changes on the timescale of hours or days, allowing
ample time to apply more computationally intensive solution techniques.
The optimization problems consider highly aggregated information, such as link-
level performance statistics or offered load between pairs of routers.
When these assumptions do not hold, the resulting optimization problems become
even more complicated, as illustrated by the following two examples.
Optimization in interdomain traffic management: In the Internet, there are often
multiple Autonomous Systems (AS) in the path between the sender and the receiver.
Each AS does not have full view of the topology, only the paths which are made vis-
ible to it through the routing-protocol messages exchanged in the Border Gateway
Protocol (BGP). In addition, each AS has a set of private policies that reflect its
business relationships with other ASes. Without full visibility and control, it is diffi-
cult to perform interdomain traffic management. For example, to implement DATE
in the Internet, the ASes would need to agree to provide explicit feedback from the
links to the end hosts or edge routers, and trust that the feedback is an honest re-
flection of network conditions. Extending BGPs to allow for multiple paths would
16 J. He et al.
simplify the underlying optimization problem, but identifying the right incentives
for ASes to deploy a multipath extension to BGP remains an open question.
Optimization in active queue management: A router may apply active queue
management schemes like Random Early Detection [13] to provide TCP senders
with early feedback about impending congestion. RED has many configurable pa-
rameters to be selected by network operators, e.g., queue-length thresholds and
maximum drop probability. Unfortunately, predictive models for how the tunable
parameters affect RED’s behavior remain elusive. In addition, the appropriate pa-
rameter values may depend on a number of factors, including the number of active
data transfers and the distribution of round-trip times, which are difficult to measure
on high-speed links. Recent analytic work demonstrates that setting RED param-
eters to stabilize TCP is fundamentally difficult [14]. It is appealing to explore
alternative active-queue management schemes that are easier to optimize, includ-
ing self-tuning algorithms that do not require the network management system to
adjust any parameters.
From these two examples, it is clear that there remain open challenges in end-
to-end traffic management. Outside the context of traffic management, network
optimization’s role is even less understood. We argue for a principled approach in
tackling these challenges so that, in time, protocol design can be less of an art and
more of a science.
The challenges are not just limited to protocols, but extend to architectural decisions
regarding the placement of functionality. Architecturally, the DATE example repre-
sents one extreme where most of computation and coordination is moved into the
distributed protocols that run in the routers. In the context of Figure 1.1, this means
much of the measurement, control, and optimization is pushed down into the net-
work. One can consider another extreme, where the network management systems
bear all the responsibility for adapting to changes in network conditions, as in [15].
Both approaches redefine the division of labor between the management system
and the routers, where one moves most of the control into the distributed protocols
and the other has the management systems directly specify how the routers handle
packets.
In some cases, having the management system bear more responsibility would
be a natural choice. For example, where an optimization problem is fundamen-
tally difficult, consequently leading to distributed solutions that are complicated or
suboptimal, or both. Unlike the routers, a management system has the luxury of a
global view of network conditions and the ability to run centralized algorithms for
computing the protocol parameters. Today’s traffic engineering uses the centralized
approach and allows operators to tailor the objectives to the administrative goals
of the network. This leads to a more evolvable system, where the objective func-
tion and constraints can differ from one network to another, and change over time.
1 Design for Optimizability: Traffic Management of a Future Internet 17
In addition, the operators can capitalize on new advances in techniques for solving
the optimization problems, providing an immediate outlet for promising research
results.
The network management system can apply centralized algorithms based on a
global view of network conditions, at the expense of a slower response based on
coarse-grain measurements. Yet some parts of traffic management, such as detecting
link failures and traffic shifts, must occur in real time. In order to understand which
functions must reside in the routers to enable adaptation on a sufficiently small time-
scale, it is important to quantify the loss in performance due to slower adaptation.
For functions which require fast adaptation, an architecture where end user loads
balance across multiple paths would be desirable. For functions that can operate on a
slower timescale, the control of flow distribution can be left to operators. In general,
determining the appropriate division of labor between the network elements and the
management systems is an avenue for future research.
References
1. F. P. Kelly, A. Maulloo, and D. Tan, “Rate control for communication networks: Shadow prices,
proportional fairness and stability,” J. of Operational Research Society, vol. 49, pp. 237–252,
March 1998.
2. S. H. Low, “A duality model of TCP and queue management algorithms,” IEEE/ACM Trans.
Networking, vol. 11, pp. 525–536, August 2003.
3. R. Srikant, The Mathematics of Internet Congestion Control. Birkhauser, 2004.
4. M. Grossglauser and J. Rexford, “Passive traffic measurement for IP operations,” in The Inter-
net as a Large-Scale Complex System, pp. 91–120, Oxford University Press, 2005.
5. R. Teixeira, A. Shaikh, T. Griffin, and J. Rexford, “Dynamics of hot-potato routing in IP net-
works,” in Proc. ACM SIGMETRICS, June 2004.
6. B. Fortz and M. Thorup, “Increasing Internet capacity using local search,” Computational Op-
timization and Applications, vol. 29, no. 1, pp. 13–48, 2004.
7. J. He, M. Bresler, M. Chiang, and J. Rexford, “Towards multi-layer traffic engineering: Opti-
mization of congestion control and routing,” IEEE J. on Selected Areas in Communications,
June 2007.
8. D. Xu, M. Chiang, and J. Rexford, “Link-state routing with hop-by-hop forwarding can achieve
optimal traffic engineering,” in Proc. IEEE INFOCOM, May 2008.
9. D. Xu, M. Chiang, and J. Rexford, “DEFT: Distributed exponentially-weighted flow splitting,”
in Proc. IEEE INFOCOM, May 2007.
10. R. Teixeira, T. Griffin, M. Resende, and J. Rexford, “TIE breaking: Tunable interdomain egress
selection,” IEEE/ACM Trans. Networking, August 2007.
11. A. Ozdaglar and D. P. Bertsekas, “Optimal solution of integer multicommodity flow problems
with application in optical networks,” Frontiers in Global Optimization, vol. 74, pp. 411–435,
2004.
12. J. He, M. Bresler, M. Chiang, and J. Rexford, “Rethinking Internet traffic management: From
multiple decompositions to a practical protocol,” in Proc. CoNEXT, December 2007.
13. S. Floyd and V. Jacobson, “Random early detection gateways for congestion avoidance,”
IEEE/ACM Trans. Networking, vol. 1, pp. 397–413, August 1993.
14. S. H. Low, F. Paganini, J. Wang, and J. C. Doyle, “Linear stability of TCP/RED and a scalable
control,” Computer Networks, vol. 43, pp. 633–647, December 2003.
15. A. Greenberg, G. Hjalmtysson, D. A. Maltz, A. Meyers, J. Rexford, G. Xie, H. Yan, J. Zhan,
and H. Zhang, “A clean slate 4D approach to network control and management,” ACM SIG-
COMM Computer Communication Review, October 2005.
Chapter 2
Valiant Load-Balancing: Building Networks
That Can Support All Traffic Matrices
Rui Zhang-Shen
Abstract This paper is a brief survey on how Valiant load-balancing (VLB) can be
used to build networks that can efficiently and reliably support all traffic matrices.
We discuss how to extend VLB to networks with heterogeneous capacities, how
to protect against failures in a VLB network, and how to interconnect two VLB
networks. For the readers’ reference, included also is a list of work that uses VLB
in various aspects of networking.
2.1 Introduction
In many networks the traffic matrix is either hard to measure and predict, or highly
variable over time. In these cases, using Valiant load-balancing (VLB) to support
all possible traffic matrices is an attractive option. For example, even though the
traffic in the Internet backbone is extremely smooth due to high level of aggre-
gation, it is still hard to measure. Accurately measuring the traffic matrix (e.g.,
using NetFlow) is too expensive to do all the time, and standard methods using
link measurements give errors of 20% or more. Even if the current traffic matrix is
satisfactorily obtained, extrapolating it to the future is fraught with uncertainty, due
to the unpredictable nature of Internet traffic growth. Finally, since Internet traffic is
dynamic, the traffic matrix can deviate from its normal values at any time, possibly
causing congestion.
The traffic demand seen by a network can be represented by the traffic matrix,
which indicates the rates at which each node initiates traffic to every other node.
We say a network can support a traffic matrix if for every link in the network, the
load caused by the traffic matrix is less than the capacity of the link. When a network
cannot support the traffic matrix presented to it, at least one link in the network has a
load higher than its capacity. Congestion occurs and backlog in the buffer builds up
on the congested link(s), causing packet drops, increased delay, and high variations
R. Zhang-Shen ()
Google, Inc., New York, NY 10011
e-mail: [email protected]
in delay. Ideally, we would like to design a network that can support a wide range
of traffic matrices, so that congestion occurs only rarely or not at all.
In this paper, we discuss the use of VLB in building networks that can efficiently
support all traffic matrices which do not over-subscribe any node. We first briefly
survey the wide use of VLB in various aspects of networking, and describe the
basic scenario of using VLB in a network. Section 2.2 extends VLB from a homo-
geneous setting to networks with arbitrary capacities, Section 2.3 describes how to
protect against and recover quickly from failures in a VLB network, and Section
2.4 proposes to use VLB to route traffic between two networks. Finally Section 2.5
discusses possible future work.
In the early 1980s, Valiant [19] first proposed the scheme of routing through a ran-
domly picked intermediate node en route to a packet’s destination. He showed that
in an N -node binary cube network, given any permutation traffic matrix, the dis-
tributed two-phase randomized routing can route every packet to its destination
within O(log N ) time with overwhelming probability. This was the first scheme
for routing arbitrary permutation in a sparse network in O(log N ) time. Since then,
such randomized routing has been used widely, and is often referred to as (VLB),
randomized load-balancing, or two-phase routing. VLB has many good character-
istics. It is decentralized, where every node makes local decisions. This also makes
the scheme scalable. VLB is agnostic to the traffic matrix because the randomness
erases the traffic pattern, and different traffic matrices can result in the same load on
the links.
Soon after its invention, was used in other interconnection networks for parallel
communication, to improve delivery time [1], and to relieve effects of adverse traffic
patterns [13]. In recent years, it was adapted for routing in torus networks [17, 18]
in order to provide worst-case performance guarantees without sacrificing average-
case performance. The key is to use VLB adaptively, based on the observation that
under low load, load-balancing only a small amount of traffic is sufficient to avoid
congestion.
VLB is also used in building network switches with great scalability and perfor-
mance guarantee, without the need of a centralized scheduler. It was used in ATM
switches [7], routers [4, 5], optical routers [3, 9], and software routers [2]. In partic-
ular, the scheme was rediscovered for designing router switch fabrics [4] to mitigate
routers’ scaling challenges, because it was difficult for centralized schemes to keep
up with the increasing link speed. In this context, it was shown that splitting traffic
in a round-robin fashion has the same effect on link load as random splitting [4],
and that is the most efficient in terms of the total required interconnection capacity
for supporting all traffic matrices [8].
Almost simultaneously, several groups independently applied the idea of VLB to
traffic engineering and network design for the Internet, in order to efficiently sup-
2 Valiant Load-Balancing: Building Networks That Can Support All Traffic Matrices 21
port all possible traffic matrices. Kodialam et al.’s two-phase routing [11, 12] is a
traffic engineering method, where a full mesh of tunnels are set up over fixed capac-
ity links and packets are sent in two phases (i.e., two hops) in the network. Winzer
et al.’s selective randomized load-balancing [14, 16, 21] used VLB and its variants
to design cost-effective optical networks. Their model assumes that a link’s cost in-
cludes both the fiber and the terminating equipment, so there is incentive for having
fewer links. In the optimal design, traffic is load-balanced only to a few intermediate
nodes. Zhang-Shen and McKewon [23, 25] proposed using VLB over a logical full
mesh in a backbone network to support all traffic matrices and to quickly recover
from failures. In addition, VLB was used as an optical routing strategy in Ether-
net LAN [20], for scheduling in metro area WDM rings [10], for circuit-switched
networks [22], and for scaling and commoditizing data center networks [6].
A study on the queueing properties of a VLB network [15] found that VLB
eliminates congestion in the network, and pseudo-random (e.g., round-robin) load-
balancing reduces queueing delay. VLB was also shown to eliminate congestion on
peering links when used to route traffic between networks [26].
Consider a network of N nodes, each with capacity r, i.e., a node can initiate traf-
fic at the maximum rate of r, and can receive traffic at the same maximum rate.
We assume that the network traffic satisfies such node aggregate constraint, be-
cause otherwise there is no way to avoid congestion. A logical link of capacity
2r
N is established between every pair of nodes over the physical links, as shown in
Figure 2.1. We use the convention that a flow in the network is defined by the source
node and the destination node, unless further specified. Every flow entering the net-
work is equally split across N two-hop paths between ingress and egress nodes,
i.e., a packet is forwarded twice in the network: In the first hop, an ingress node
uniformly distributes each of its incoming flows to all the N nodes, regardless of
r 2r/ N r
1 2
r r
N 3
the destinations. In the second hop, all packets are sent to the final destinations by
the intermediate nodes. Load-balancing can be done packet-by-packet, or flow-by-
flow at the application flow level. The splitting of traffic can be random (e.g., to a
randomly picked intermediate node) or deterministic (e.g., round-robin).
Assume we can achieve perfect load-balancing, i.e., can split traffic at the ex-
act proportions we desire, then each node receives exactly N1 of every flow after
first-hop routing. This means, all the N nodes equally share the burden of for-
warding traffic as the intermediate node. When the intermediate node happens to
be the ingress or egress node, the flow actually traverses one hop (the direct link
between ingress and egress) in the network. Hence, N2 of every flow traverses the
corresponding one-hop path.
Such uniform load-balancing can guarantee to support all traffic matrices in this
network. Since the incoming traffic rate to each node is at most r, and the traffic is
evenly load-balanced to N nodes, the actual traffic on each link due to the first-hop
routing is at most Nr . The second-hop routing is the dual of the first-hop routing.
Since each node can receive traffic at a maximum rate of r and receives N1 of the
traffic from every node, the actual traffic on each link due to the second-hop routing
is also at most Nr . Therefore, a full-mesh network where each link has capacity 2r N
is sufficient to support all traffic matrices in a network of N nodes of capacity r.
This is perhaps a surprising result – a network where any two nodes are connected
with a link of capacity 2rN can support traffic matrices where a node can send traffic
to another node at rate r. It shows the power of load-balancing. In VLB, each flow
is carried by N paths, and each link carries a fraction of many flows; therefore any
large flow is averaged out by other small flows. In a static full-mesh network, if all
the traffic were to be sent through direct paths, we would need a full-mesh network
of link capacity r to support all possible traffic matrices; therefore load-balancing is
N
2
times more efficient than direct routing.
Real-life networks are often heterogeneous, i.e., the nodes in a network can have
different capacities. In this section we discuss how to extend the result of uniform
load-balancing to heterogeneous networks [24].
We first introduce notations. In a network of N nodes, the traffic matrix D
fij g is an N N matrix, where the entry ij indicates the datarate at which Node
i initiates traffic destined to Node j . The traffic rate is typically averaged over a
long period of time so we consider it as constant. Typically in a network there are
buffers to absorb short-lived traffic fluctuations. Suppose Node i has capacity ri ,
i.e., the node can initiate traffic at the maximum rate of ri , and can receive traffic at
the same maximum rate. So in order for the traffic matrix to not over-subscribe any
node, it must satisfy
X X
ij ri ; 8i and j i ri ; 8i: (2.1)
j j
2 Valiant Load-Balancing: Building Networks That Can Support All Traffic Matrices 23
Without loss of generality, assume that the nodes have been sorted according to
decreasing capacities, i.e., r1 r2 rN , so Node 1 is the largest node
P
and Node N the smallest. Let R be the total node capacity, i.e., R D N i D1 ri . We
P PN
assume r1 N r
i D2 i because even if r 1 > r
i D2 i , Node 1 cannot send or receive
PN
traffic at a rate higher than i D2 ri , because that would over-subscribe some nodes.
Suppose that a full mesh of logical links are set up to connect these N nodes.
Let cij represent the required link capacity from Node i to Node j and C the link
capacity matrix fcij g. Having cij D 0 means that link .i; j / is not needed. The
simple homogeneous network presented in Section 2.1.2 has that cij D 2r N
; 8i ¤ j ,
and ri D r; 8i .
In a network with identical nodes, it is natural to load-balance uniformly. But
uniform load-balancing seems too restrictive in a heterogeneous network because it
does not take into account the difference in node capacities. A natural solution is
to load-balance proportionally to the capacity of the intermediate node, i.e., Node i
receives fraction rRi of every flow. This is a direct generalization from uniform mul-
ticommodity flow in the homogeneous case to product multicommodity flow [16].
The required link capacity is cij D 2ri rj =R [24].
We can further generalize VLB to allow any flow splitting ratio and let some
external objective to determine the optimal ratios. We introduce a set of load-
P
balancing parameters pi such that pi 0, for all i , and N i D1 pi D 1. An ingress
node splits each flow according to fpi g and sends pi of every flow to Node i . This
gives us the freedom of, for example, letting the larger nodes forward more traffic
than the smaller nodes, or not using some of the nodes as intermediates (by setting
the corresponding pi to zero). If there are some objectives to be optimized, there
are N parameters (pi , i D 1; 2; : : : ; N ) that can be tuned, but if more freedom
is needed, we can, for example, let each flow have its own set of load-balancing
parameters.
We now find the required link capacity. The first-hop traffic on link .i; j / is the
traffic initiated by Node i that is load-balanced to Node j , and the rate is at most
ri pj . The second-hop traffic on the link is the traffic destined to Node j that is load-
balanced to Node i , and the rate is at most rj pi . Therefore the maximum amount
of traffic on link .i; j / is ri pj C rj pi , which is also the required capacity on link
.i; j /
cij D ri pj C rj pi : (2.2)
The required (outgoing) interconnection capacity of Node i is
X
li D cij D ri C Rpi 2ri pi ;
j Wj ¤i
We can show that Equation (2.4) is also the minimum total interconnection capac-
ity required by any network to support all traffic matrices [24], and hence is the
necessary and sufficient condition. One network that minimizes the total required
interconnection capacity is a “star” with Node 1 at the center, i.e., p1 D 1, pi D 0
for i 2. In the case where all nodes have the same capacity, to achieve mini-
mum L, pi can take any non-negative values as long as they sum up to 1. Thus the
uniform load-balancing presented in Section 2.1.2 is optimal in terms of total re-
quired interconnection capacity. Splitting flows proportional to node capacities, i.e.,
pi D ri =R, is not optimal when nodes have different capacities.
In order to use the minimum amount of interconnection capacity, only nodes with
the largest capacity can act as intermediate nodes. This can be limiting, especially if
only one node has the largest capacity, because a star network is not good for fault
tolerance. A star network is efficient but not balanced, because the center node acts
as the intermediate node for all traffic. We need a scheme that is not only efficient
but also balanced, and we found one by minimizing the network fanout.
The fanout of node i is fi D rlii , the ratio of node i ’s interconnection capacity
to its node capacity. Since the interconnection capacity is used both for sending
traffic originated from the node and for forwarding traffic for other nodes, the fanout
measures the amount of responsibility the node has to forward other nodes’ traffic
relative to its size. If the fanouts of the two nodes are the same, then the larger node
forwards more traffic, which is a desired property. Thus, to have a balanced network,
we minimize the maximum fanout over all nodes, which results in all nodes having
equal fanout. The resulting load-balancing parameters and fanout are
ri
R2ri
pi D P rk ; i D 1; 2; : : : ; N;
k R2rk
1
fi D 1 C PN rj ; 8i:
j D1 R2rj
The optimal load-balancing parameters are almost proportional to the node ca-
ri
pacities: pi / R2r i
. The parameter pi is a strictly increasing function of ri .
Therefore a larger node has greater responsibility for forwarding traffic. We can
further show
p that the total interconnection capacity used in this scheme is no more
than 12 . 2 C 1/ D 1:207 times the minimum total capacity [24]. Thus, the scheme
of minimizing maximum fanout is not only balanced, but also efficient.
One can of course choose to optimize other criteria that are more suitable for the
particular situation. Kodialam et al. [11] assume that the underlying physical links
have fixed capacities, constraining the logical link capacities. They give efficient
2 Valiant Load-Balancing: Building Networks That Can Support All Traffic Matrices 25
1
We focus on failures in the logical topology, and since several logical links can share a physical
link, a physical failure can correspond to multiple logical failures.
2
If a node fails, we discard the traffic originating from or terminating at this node.
26 R. Zhang-Shen
for supporting all traffic matrices. In general, when there are k node failures, the
network becomes an .N k/-node full mesh, so the link capacity required to tolerate
k node failures is
2r
C.N; k; 0/ D C.N k; 0; 0/ D : (2.5)
N k
Link failures are a little more complicated, as they can destroy the symmetry of
the topology, so that it is no longer a full mesh. We only consider the worst-case
failure scenarios (adversarial link failures) here. We omit the details and only give
the final result. The amount of capacity required on each link to tolerate k arbitrary
link failures, for 1 k N 2, is
8 r r
< N 2 C N
kD1
r r
C.N; 0; k/ D C k D 2 or N 2; or N 6 (2.6)
: N k1
2r
N 1
N k
otherwise:
2r 2r
C.N; kn ; kl / D ; (2.7)
N kn kl N k
where k D kn C kl . This means that the curve for Equation (2.7) is roughly the
same as that for Equation (2.6), shown in Figure 2.2. So we conclude that in a VLB
network, a small amount of over-provisioning goes a long way to make the network
1.2
Required capacity of each link (multiples of r)
0.8
0.6
0.4
0.2
Fig. 2.2 The required link
capacity vs. the number of 0
link failures in a 50-node 0 10 20 30 40 50
network Number of link failures, k
2 Valiant Load-Balancing: Building Networks That Can Support All Traffic Matrices 27
fault tolerant. For example, if the links in a 50-node network are over-provisioned
by just about 11%, the network can tolerate any five (node or link) failures.
Today, most congestion in Internet backbone takes place on the peering links con-
necting them. When a link between two networks is congested, very likely some
other peering links between these two networks are lightly loaded. That is, the peer-
ing links between two networks are usually not evenly utilized. We propose to use
a technique similar to VLB to route peering traffic, so as to eliminate congestion on
the peering links. We show that if traffic is load-balanced over all the peering links
between two networks, there will be no congestion as long as the total peering ca-
pacity is greater than the total peering traffic. Even though the two networks using
VLB to route their peering traffic do not need to use VLB internally, we assume
they do and analyze how the peering scheme affects how the networks run VLB.
Suppose two VLB networks are connected by a subset of their nodes (the peering
nodes), as shown in Figure 2.3. For the ease of description, we use the same num-
bering for the peering nodes in both networks. (Note that this convention is different
from Section 2.2.) The traffic exchanged between the two networks is called peering
traffic and assume the total amount is no more than Rp in each direction.
In the network of N nodes, we introduce the peering load-balancing parameters
qi ; i D 1; 2; : : : ; N , such that a portion qi of the peering traffic between the two
networks is exchanged at node i . Naturally, qi D 0 if i is not a peering node.
The peering load-balancing parameters, qi , together with the maximum peering
traffic between the two networks, Rp , determine the sizes of the peering links: the
required capacity of the peering link at node i is Rp qi . Suppose the peering links
have the required capacities, then if the peering traffic is load-balanced across the
peering links according to the proportions qi , and the total amount of peering traffic
between the two networks does not exceed Rp , there will be no congestion on the
peering links.
Network 1 Rp Network 2
q2
1 2 2 1
q3
N 3 3 N
… 4 4 …
q4
Fig. 2.3 Two VLB networks connect at a set of peering nodes. The total amount of traffic
exchanged between the two networks is no more than Rp , and a portion qi of the peering traffic is
exchanged at node i .
28 R. Zhang-Shen
The extra requirement of routing peering traffic may result in higher capacity
requirements inside the networks. If we treat the peering traffic that originates
from the network as traffic destined to the peering nodes, and the peering traffic
that enters the network as traffic originated from the peering nodes, then the peering
traffic may have to traverse two hops in each network.
Alternatively, we can load-balance peering traffic over the peering points only,
instead of all the nodes. Thus, we require that peering traffic traverses at most one
hop in each network, and at most two hops altogether.
Suppose the peering load-balancing parameters are fixed, for example, through
negotiation between the two networks. Then we can vary pi , i.e., how non-peering
traffic is routed internally, to minimize the required link capacities. We observe that
Rp is likely to be bigger than the node capacities ri . Rp is the total amount of traffic
the two network exchanges and can be a large portion of the network’s total traffic R,
while the node capacities are likely to make up only a small fraction of R, on the
R
order of N .
If we assume that Rp ri for all i , then we have
and the minimum cij is achieved for all links when pi D qi for all i . So the optimal
capacity allocation in a network with peering traffic is
cij D ri qj C rj qi : (2.8)
Since qi is zero if node i is a non-peering node, cij D 0 if both Node i and Node
j are non-peering nodes. The network is now a two-tiered one: in the center is a
full mesh connecting the peering nodes, and on the edge are the non-peering nodes,
each connected to all the center nodes.
Setting local load-balancing parameters to be the same as peering load-balancing
parameters means that only the peering nodes will serve as intermediate nodes to
forward traffic. Peering nodes are often the largest nodes in the network, so they
should have larger responsibilities in forwarding traffic. The analysis shows that the
optimal way is to only let the peering nodes forward traffic. This has the additional
benefits of requiring fewer links and reducing network complexity.
2.5 Discussions
Using VLB to route traffic in a network has many advantages, such as efficiency and
fast failure recovery. There are some open questions as well. Some examples are
Sending packets through two hops may increase the propagation delay,
while sending packets through the direct link may cause increased capacity
requirement. There should be a tradeoff between packet delay and required
capacity.
2 Valiant Load-Balancing: Building Networks That Can Support All Traffic Matrices 29
Different paths that a flow traverses may have different delays, and hence packets
in the flow may not arrive at the destination in the original order. This is usually
not a problem in the Internet, since a flow may consist of many application-level
flows, and can be split accordingly for load-balancing. But it can be a problem if
a flow cannot be easily split.
It is unclear whether VLB can be incrementally deployed in a network.
References
15. R. Prasad, P. Winzer, S. Borst, and M. Thottan. Queuing delays in randomized load balanced
networks. In Proc. IEEE INFOCOM, May 2007.
16. F. B. Shepherd and P. J. Winzer. Selective randomized load balancing and mesh networks with
changing demands. Journal of Optical Networking, 5:320–339, 2006.
17. A. Singh. Load-Balanced Routing in Interconnection Networks. PhD thesis, Department of
Electrical Engineering, Stanford University, 2005.
18. A. Singh, W. J. Dally, B. Towles, and A. K. Gupta. Locality-preserving randomized oblivious
routing on torus networks. In SPAA ’02: Proceedings of the fourteenth annual ACM symposium
on parallel algorithms and architectures, pages 9–13, 2002.
19. L. G. Valiant. A scheme for fast parallel communication. SIAM Journal on Computing,
11(2):350–361, 1982.
20. R. van Haalen, R. Malhotra, and A. de Heer. Optimized routing for providing Ethernet LAN
services. Communications Magazine, IEEE, 43(11):158–164, Nov. 2005.
21. P. J. Winzer, F. B. Shepherd, P. Oswald, and M. Zirngibl. Robust network design and selective
randomized load balancing. 31st European Conference on Optical Communication (ECOC),
1:23–24, September 2005.
22. R. Zhang-Shen, M. Kodialam, and T. V. Lakshman. Achieving bounded blocking in circuit-
switched networks. IEEE INFOCOM 2006, pages 1–9, April 2006.
23. R. Zhang-Shen and N. McKeown. Designing a Predictable Internet Backbone Network. In
HotNets III, November 2004.
24. R. Zhang-Shen and N. McKeown. Designing a predictable Internet backbone with Valiant
Load-Balancing. Thirteenth International Workshop on Quality of Service (IWQoS), 2005.
25. R. Zhang-Shen and N. McKeown. Designing a Fault-Tolerant Network Using Valiant Load-
Balancing. Proc. IEEE INFOCOM, pages 2360–2368, April 2008.
26. R. Zhang-Shen and N. McKeown. Guaranteeing Quality of Service to Peering Traffic. Proc.
IEEE INFOCOM, pages 1472–1480, April 2008.
Chapter 3
Geometric Capacity Provisioning
for Wavelength-Switched WDM Networks
3.1 Introduction
L.-W. Chen
Vanu Inc., One Cambridge Center, Cambridge, MA 02142
e-mail: [email protected]
E. Modiano ()
Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139
e-mail: [email protected]
Portions reprinted, with permission, from “A Geometric Approach to Capacity Provisioning in
WDM Networks with Dynamic Traffic”, 40th Annual Conference on Information Sciences and
Systems. 2006
c IEEE.
Fig. 3.1 An example of a mesh optical network consisting of numerous nodes and links, followed
by a shared-link model based on the link. The dotted lines denote different users of the link. Since
each pair of input–output fibers comprises a different user, and there are four input fibers and four
output fibers, there are a total of 4 4 D 16 users in this example
and destination fibers. Furthermore, this assignment can change over time as traffic
demands change. This obviously imparts a great deal of additional flexibility. The
downside is that the added switching and processing hardware makes it more ex-
pensive to dynamically provision wavelengths.
There has been much investigation of both statically provisioned and dynami-
cally provisioned systems in the literature [1–4]. Such approaches are well suited
for cases where either the traffic is known a priori and can be statically provisioned,
or is extremely unpredictable and needs to be dynamically provisioned. However,
in practice, due to statistical multiplexing, it is common to see traffic demands char-
acterized by a large mean and a small variance around that mean. A hybrid system
is well suited to such a scenario. In a hybrid system, a sufficient number of wave-
lengths are statically provisioned to support the majority of the traffic. Then, on top
of this, a smaller number of wavelengths are dynamically provisioned to support
the inevitable variation in the realized traffic. Such an approach takes advantage
of the relative predicability of the traffic by cheaply provisioning the majority of
the wavelengths, but retains sufficient flexibility through the minority of dynamic
wavelengths that significant wavelength overprovisioning is not necessary.
After describing the system model used in this chapter, we will use the asymp-
totic analysis approach from information theory incorporated in the proof of Shan-
non’s channel capacity theorem [5] to analyze hybrid networks: we allow the
number of users to become large, and consider the minimum provisioning in static
and dynamic wavelengths necessary to achieve non-blocking performance (i.e., to
guarantee that the probability of any call in the snapshot being blocked goes to zero).
We will show that it is always optimal to statically provision enough wavelengths to
support the traffic mean. We also fully characterize the optimal provisioning strategy
for achieving non-blocking performance with minimal wavelength provisioning.
In the shared link context, we can consider each incoming–outgoing pair of fibers
to be a different user of the link. Each lightpath request (which we will henceforth
term a call) can therefore be thought of as belonging to the user corresponding to
the incoming–outgoing fiber pair that it uses. We can similarly associate each static
wavelength with the corresponding user. Under these definitions, a call belonging
to a given user cannot use a static wavelength belonging to a different user – it
must either use a static wavelength belonging to its own user, or employ a dynamic
wavelength.
Figure 3.2 gives a pictorial representation of the decision process for admitting
a call. When a user requests a new call setup, the link checks to see if a static
wavelength for that user is free. If there is a free static wavelength, it is used. If not,
then the link checks to see if any of the shared dynamic wavelengths are free – if so,
then a dynamic wavelength is used. If not, then no resources are available to support
the call, and it is blocked.
34 L.-W. Chen and E. Modiano
Fig. 3.2 Decision process for wavelength assignment for a new call arrival. A new call first tries
to use a static wavelength if it is available. If not, it tries to use a dynamic wavelength. If again
none are available, then it is blocked
There have been several approaches developed in the literature for blocking prob-
ability analysis of such systems under Poisson traffic models [6], including the
Equivalent Random Traffic (ERT) model [7–9] and the Hayward approximation
[10]. These approaches, while often able to produce good numerical approxima-
tions of blocking probability, are purely numerical in nature and do not provide
good intuition for guiding the dimensioning of the wavelengths.
In this chapter, we adopt a snapshot traffic model that leads to closed-form
asymptotic analysis and develop guidelines for efficient dimensioning of hybrid net-
works. We consider examining a “snapshot” of the traffic demand at some instant
in time. The snapshot is composed of the vector c D Œc1 ; : : : ; cN , where ci is the
number of calls that user i has at the instant of the snapshot, and N is the total
number of users.
We model each variable ci as a Gaussian random variable with mean i and
variance i2 . This is reasonable since each “user” actually consists of a collection
of source–destination pairs in the larger network that all use the link from the same
source fiber to the same destination fiber. Initially we will assume that each user
has the same mean and variance 2 and later extend the results to general i and
i . Although the traffic for each individual source–destination pair for the user may
have some arbitrary distribution, as long as the distributions are well behaved, the
sum of each traffic stream will appear Gaussian by the Central Limit Theorem.
As a special case, consider the common model of Poisson arrivals and expo-
nential holding times for calls. Then the number of calls that would have entered
a non-blocking system at any instant in time is given by the stationary distribution
of an M=M=1 queue – namely, Poisson with intensity equal to the load in Er-
langs. For a heavy load, this distribution is well approximated by a Gaussian random
variable with mean and variance .
3 Geometric Capacity Provisioning for Wavelength-Switched WDM Networks 35
In this section, we consider a shared link, and assume that there are N users that are
the source of calls on the link. Each user is statically provisioned Ws wavelengths
for use exclusively by that user. In addition to this static provisioning, we will also
provide a total of Wd dynamically switched wavelengths. These wavelengths can be
shared by any of the N users.
As previously described, we will use a snapshot model of traffic. The traffic is
given by a vector c D Œc1 ; : : : ; cN , where each ci is independent and identically
distributed as N.; 2 /. We assume that the mean is sufficiently large relative
to that the probability of “negative traffic” (where the realized value of a random
variable representing the number of calls is negative, a physical impossibility) is low,
and therefore does not present a significant modeling concern. We will primarily be
concerned with a special blocking event that we call overflow. An overflow event
occurs when there are insufficient resources to support all calls in the snapshot and
at least one call is blocked. We will call the probability of this event the overflow
probability.
From Figure 3.2, we see that an overflow event occurs if the total number of calls
exceeds the ability of the static and dynamic wavelengths to support them. This can
be expressed mathematically as
X
N
max fci Ws ; 0g > Wd ; (3.1)
i D1
where max fci Ws ; 0g is the amount of traffic from each user that exceeds the
static provisioning; if the total amount of excess from each user exceeds the avail-
able pool of shared dynamic wavelengths, a blocking event occurs.
If we consider the N -dimensional vector space occupied by c, the constraint
given by (3.1) represents a collection of hyperplanes bounding the admissible traffic
region:
ci Ws C Wd
ci C cj 2Ws C Wd ; i ¤ j;
ci C cj C ck 3Ws C Wd ; i ¤j ¤k
::
:
Each constraint reflects the fact that the sum of the traffic from any subset of users
clearly cannot exceed the sum of the static provisioning for those users plus the
entire dynamic provisioning available. Note that there are a total of N sets of con-
NŠ
straints, where the nth set consists of C.N; n/ D .N i /ŠnŠ
equations, each involving
the sum of n elements of the traffic vector c. If the traffic snapshot c falls within the
region defined by the hyperplanes, all calls are admissible; otherwise, an overflow
event occurs. The bold lines in Figure 3.3 show the admissible region for N D 2 in
two dimensions.
36 L.-W. Chen and E. Modiano
Fig. 3.3 The admissible traffic region, in two dimensions, for N D 2. Three lines form the bound-
ary constraints represented by (3.1). There are two lines each associated with a single element of
the call vector c, and one line associated with both elements of c. The traffic sphere must be entirely
contained within this admissible region for the link to be asymptotically non-blocking
We will consider the case where the number of users N becomes large, and use the
law of large numbers to help us draw some conclusions. We can rewrite the call
vector in the form
c D 1 C c0 ;
where is the (scalar) value of the mean, 1 is the length-N all-ones vector, and
c0 N.0; 2 1/ is a zero-mean Gaussian random vector with i.i.d. components. Con-
ceptually, we can visualize the random traffic vector as a random vector c0 centered
at 1. The length of this random vector is given by
v
uN
0 u X 2
c D t c0 : i
i D1
We use an approach very similar to the sphere packing argument used in the proof
of Shannon’s channel capacity theorem in information theory [5]. We will show that
asymptotically as the number of users becomes large, the traffic vector falls onto a
3 Geometric Capacity Provisioning for Wavelength-Switched WDM Networks 37
sphere centered at the mean, and the provisioning becomes a problem of choosing
the appropriate number of static and dynamic wavelengths so that this traffic sphere
is completely contained within the admissible region.
From the law of large numbers, we know that
1 X 02
N
ci ! 2
N
i D1
Next, we will derive necessary and sufficient conditions for the admissible traffic
region to enclose the traffic sphere. Our goal is to ensure that we provision Ws
and Wd such that the minimum distance from the center of the traffic sphere to
the boundary of the admissible region is at least the radius of the sphere, therefore
ensuring that all the traffic will fall within the admissible region.
Due to the identical distribution of the traffic for each user, the mean point 1
will be equidistant from all planes whose description involves the same number of
elements of c. We define a distance function f .n/ such that f .n/ is the minimum
distance from the mean 1 to any hyperplane whose description involves n compo-
nents of c.
Lemma 3.1. The distance function f .n/ from the traffic mean to a hyperplane in-
volving n elements of the traffic vector c is given by
p Wd
f .n/ D n Ws C ; n D 1; : : : ; N (3.2)
n
Proof. This is essentially a basic geometric exercise. For a fixed n, the hyperplane
has a normal vector consisting of n unity entries and N n zero entries. Since
by symmetry the mean of the traffic is equidistant from all hyperplanes with the
same number of active constraints, without loss of generality, assume the first n
constraints that are active. Then the closest point on the hyperplane has the form
Œ C x; : : : ; C x; ; : : : ;
where the first n entries are C x, and the remainder are . The collection of
hyperplanar constraints described by (3.1) can then be rewritten in the form
38 L.-W. Chen and E. Modiano
X
n
ci nWs C Wd (3.3)
i D1
The value of x for which c lies on the hyperplane is obtained when the constraint
in (3.3) becomes tight, which requires that
X
n
. C x/ D nWs C Wd
i D1
) nx D nWs C Wd n
Wd
x D Ws C
n
The distance from the point Œ; : : : ; to this point on the hyperplane is
k Œ C x; : : : ; C x; ; : : : ; Œ; : : : ; k
p
D nx 2
p
D nx
We would like to determine the index n such that f .n/ is minimized. Unfortunately,
this value of n turns out to depend on the choice of provisioning Ws . Let us consider
the derivative of the distance function f 0 .n/:
1 Wd p Wd
f 0 .n/ D p Ws C C n 2
2 n n n
1 Wd
D p Ws
2 n n
3 Geometric Capacity Provisioning for Wavelength-Switched WDM Networks 39
Regime 1: If Ws
In this region, f 0 .n/ < 0 for all n. This implies that f .n/ is a decreasing function
of n, and Fmin D f .N /, giving a minimum distance of
p Wd
Fmin D N Ws C
N
Regime 2: If < Ws C Wd
In this region, f 0 .n/ starts out negative and ends up positive over 1 n N . This
implies that f .n/ is convex and has a minimum. Neglecting integrality concerns,
this minimum occurs when f 0 .n/ D 0, or
Wd
n D
Ws
Therefore Fmin D f .n / in this regime. Substituting the appropriate values, it can
be shown that the minimum distance is given by
p
Fmin D 2 Wd .Ws /
Regime 3: If Ws > C Wd
In this region, f 0 .n/ > 0 for all n. This implies that f .n/ is an increasing function
of n, and Fmin D f .1/, giving a minimum distance of
Fmin D Ws C Wd
In the preceding section, we derived the minimum distance criteria for the hybrid
p number of statically allocated wavelengths Ws , we can use the
system. Given a fixed
equation Fmin N to calculate the minimum number of dynamic wavelengths
Wd to achieve asymptotically non-overflow performance. We can also draw a few
additional conclusions about provisioning hybrid systems.
Proof. For Ws , we know from Case 1 above that the minimum distance
constraint is
p Wd
p
Fmin D N Ws C N
N
Wd
Ws C C
N
) Wt ot D N Ws C Wd . C /N
Wt ot . C /N (3.4)
We can also consider a system that is fully static, with no dynamic provisioning.
This is the most inflexible wavelength partitioning, and provides us with an upper
bound on the number of wavelengths required by any hybrid system.
Theorem 3.2. For a fully static system with no dynamic provisioning, the minimum
number of wavelengths required is given by
p
Wt ot D . C /N C N 1 N
Note
p that this exceeds the lower bound on the minimum number of wavelengths
by . N 1/N. We can therefore regard this quantity as the maximum switch-
ing gain that we can achieve in the hybrid system. This gain is measured in the
maximum number of wavelengths that could be saved if all wavelengths were
dynamically switched.
Combining the upper and lower bounds, we can make the following observation:
Corollary: For efficient overflow-free operation, the total number of wavelengths
required by any hybrid system is bounded by
p
. C /N Wt ot . C /N C . N 1/N
We examine the following numerical example to illustrate the application of the pro-
visioning results described. Consider a system with some number of users N . Under
the snapshot model each user generates traffic that is Gaussian with mean D 100
and standard deviation D 10. We would like to provision the system to be asymp-
totically non-blocking as N becomes large. This is equivalent to provisioning the
system so that the probability of an overflow event goes to zero.
From Theorem 3.1 we know that a minimum of Ws D static wavelengths
should always be provisioned. From (3.4), we have
Wt ot D N Ws C Wd . C /N
) Wd D N C N N Ws
D N
Figure 3.4 shows the overflow probability as N increases for a system provisioned
with Ws and Wd wavelengths according to the equations given above as obtained
through simulation. The rapidly descending curve shows that if the theoretical min-
imum of Wt ot D . C /N wavelengths is provisioned with Ws D , then as N
increases, the overflow probability drops off quickly and eventually the system be-
comes asymptotically non-blocking. The second curve shows overflow probability
when the pool of dynamic wavelengths has been reduced to bring Wt ot down by
5%. We see that in this case, the overflow probability remains flat and no longer
decreases as a function of the number of users.
Next suppose that we would like to provision additional static wavelengths to re-
duce the number of dynamic wavelengths required. Consider a provisioning scheme
where Ws D 1:1 . For reasonably large N , this puts us in the region where
< Ws C Wd . In this regime,
42 L.-W. Chen and E. Modiano
Fig. 3.4 Curves show decrease in overflow probability with increasing number of users N . Note
that if significantly fewer than Wtot wavelengths are provisioned, the overflow probability no longer
converges to zero as the number of users increases
p p
Fmin D 2 Wd .Ws / N
4Wd .Ws / N 2
N 2
Wd
4.Ws /
N 2
D
0:4
The first curve in Figure 3.5 shows the decrease in the overflow probability both
when Ws and Wd are provisioned according to these equations. In the second curve,
both the static and dynamic pools have been reduced in equal proportions such that
the total number of wavelengths has decreased by 5%. We again see that the over-
flow probability no longer decreases as N increases.
Finally, Table 3.1 illustrates the tradeoff between provisioning more wavelengths
statically versus the total number of wavelengths required in this example. We see
that in the minimally statically provisioned case, the total number of wavelengths is
small, at the cost of a large number of dynamic wavelengths. By overprovisioning
the mean statically, as in the second case, the number of dynamic wavelengths can
be significantly reduced, at the cost of increasing the total number of wavelengths.
The optimal tradeoff in a specific case will depend on the relative cost of static
versus dynamic wavelengths.
3 Geometric Capacity Provisioning for Wavelength-Switched WDM Networks 43
Fig. 3.5 Curves show decrease in overflow probability with increasing number of users N . Again
note that if fewer than Wtot wavelengths are provisioned, the overflow probability no longer con-
verges to zero as the number of users increases
The majority of this chapter has dealt with the case of independent identically dis-
tributed user traffic: we have assumed that i D and i2 D 2 for all users i . In
many scenarios this will not be the case. Depending on the applications being served
and usage profiles, users could have traffic demands that differ significantly from
each other. In this section, we discuss how to deal with non-IID traffic scenarios.
We now consider each user i to be characterized by traffic ci , where ci
N.i ; i2 /. It now makes sense to allow for a different number of static wavelengths
.i /
Ws to be provisioned per user. As before, an overflow occurs if
X
N n o
max ci Ws.i / ; 0 > Wd (3.5)
i D1
Note that each cOi is now an IID standard Gaussian random variable with mean 0 and
variance 1. We can rewrite (3.5) in the form
X
N n o
max i cOi C i Ws.i / ; 0 > Wd
i D1
Again consider the nth set of boundary constraints, and suppose that the first n
elements of c are active. Then we require
X
n
i cOi C i Ws.i / Wd
i D1
3 Geometric Capacity Provisioning for Wavelength-Switched WDM Networks 45
X
n n
X
i cOi Wd C Ws.i / i (3.6)
i D1 i D1
Note that the equations in (3.6) again describe sets of hyperplanes that form the
admissible region for the traffic vector cO D ŒcO1 ; : : : ; cON . As the number of users
p
becomes large, the traffic vector will concentrate itself on a sphere of radius N
centered at the origin. Therefore, a necessary and sufficient condition for the system
to be asymptotically non-blocking is simply for p the minimum distance from the
origin to each of the hyperplanes to be at least N .
3.3 Conclusion
Acknowledgement This work was supported in part by NSF grants ANI-0073730, ANI-0335217,
and CNS-0626781.
References
Abstract Further advances in cellular system performance will likely come from
effective interference management techniques. In this chapter we focus on two
emerging interference management methods, namely fractional frequency reuse
(FFR) and Network MIMO, that transcend the limits of existing interference man-
agement techniques in cellular systems. FFR achieves interference avoidance be-
tween sectors through intelligent allocation of power, time and frequency resources
that explicitly takes into account the impact on the neighbor sector performance of
interference that is created as result of the allocation. Network MIMO is aimed at
interference cancellation through joint coding and signal processing across multiple
base stations. We present an overview of these techniques along with example sim-
ulation results to show the benefits that can be achieved. These techniques vary in
complexity and the magnitude of the performance benefit achieved.
4.1 Introduction
Next-generation cellular systems are currently being developed and will be deployed
in a few years time. These systems target significantly higher aggregate capacities
and higher per-user data rates compared to existing systems [1, 2]. In particular, one
of the goals of these systems is to boost the performance of users at the cell edge that
typically suffer from significant out-of-cell interference. A variety of innovations,
including multiple-input multiple-output (MIMO) multi-antenna techniques and use
of wider signaling bandwidths, are being adopted to achieve the desired level of
performance.
H. Viswanathan ()
Bell Labs, Alcatel Lucent 600-700 Mountain Avenue, Murray Hill, NJ 07974
e-mail: [email protected]
S. Venkatesan
Bell Labs, Alcatel Lucent R-207, 791 Holmdel Road, Holmdel, NJ 07733
e-mail: [email protected]
In this section we provide a brief overview of some of the key building blocks upon
which the interference management techniques discussed in this chapter are built.
OFDMA systems supporting FFR for interference mitigation divide frequency sub-
carriers into several sub-bands or more generally frequency and time resources into
resource sets. Frequency hopping of sub-carriers is restricted to be within the sub-
carriers of a resource set so that users scheduled on a certain resource set experience
interference only from users scheduled in neighboring sectors in the same resource
set. Typically, each resource set is reserved for a certain reuse factor and is asso-
ciated with a particular transmission power profile. For example, suppose we have
three sectors covering a certain area, and there are four resource sets. Then, resource
set 4 can be reserved for “good geometry” users (those close to their base station,
with less interference from other sectors) in all sectors, and resource sets 1, 2, 3,
for “bad geometry” users (further from their base station, more interference from
other sectors ) in sectors 1, 2, 3, respectively. As a result, we have 1/3 reuse for bad
geometry users and 1/1 (i.e., universal) reuse for good geometry users. This is an
example of a fractional frequency reuse. The FFR concept is illustrated in Figure 4.1
where example resource allocations for integer reuse and fractional reuse are shown.
Note that FFR can also be “soft” reuse in the sense that although all resource sets
4 Spectrum and Interference Management in Next-Generation Wireless Networks 51
Fig. 4.1 Illustration of integer frequency reuse and fractional frequency reuse resource allocations
in three sector cells
are utilized in all cells, a reuse pattern is created through non-uniform transmission
of power across the different resource sets – most of the power is transmitted on a
subset of the resource sets while a small portion of the power is transmitted on the
remaining resource sets.
Fixed FFR does not adapt to traffic dynamics in the sense that the frequency reuse
achieved is not adjusted based on interference conditions experienced by the users.
Instead of a fixed partition of the available bandwidth leading to fixed frequency
reuse it is possible to achieve dynamic frequency reuse through prioritized use of
sub-carriers in the adjacent sectors. Interference avoidance is achieved by assigning
different priority orders in the neighboring sectors so that when transmitting to cell
edge users using sub-channelization, the neighboring interfering sectors transmit on
different sub-carriers. Such a scheme was described in [4]. For such an approach it
is necessary to do some a priori frequency planning to determine the priorities and
also to dynamically adjust the thresholds under which users are assigned to different
resource sets.
It is also possible for the FFR patterns to be obtained “automatically” without any
prior frequency planning. One such approach is described in [5] where each sector
constantly performs a “selfish” optimization of the assignment of its users to re-
source sets, with the objective of optimizing its own performance. The optimization
is done based on the interference levels reported by users for different resource sets,
and is performed “continuously” via a computationally efficient shadow scheduling
algorithm. This approach is shown to achieve efficient frequency reuse patterns that
dynamically adapt to the traffic distribution.
We present a brief overview of the algorithm from [5]. The reader is referred to [5]
for the details.
52 H. Viswanathan and S. Venkatesan
Consider one of the cells, which needs to support N constant bit rate (CBR)-
type flows, say VoIP. Similar ideas can be applied to develop algorithms for elastic,
best effort traffic. For each user i 2 I D f1; : : : ; N g, the cell’s base station (BS)
can choose which sub-band j to assign it to. Given the other-cell interference levels,
currently observed by user i , the BS “knows” (i.e., can estimate from user feedback)
that it would need to allocate mij sub-carriers and average power pij , if this user is
to be assigned to sub-band j . Since other-cell interference is not constant in time (it
depends on the user-to-sub-band assignments in other cells, and the actual powers
those users require), the values of mij and pij change with time. However, these
“parameters” depend on time-averaged interference levels (over the intervals of the
order of 1 second), and therefore they do not change “fast” (i.e., from slot to slot).
Any user-to-sub-band assignment the cell employs at a given time should be such
that sub-band capacities are not exceeded:
X
mij c; 8j; (4.1)
i 2A.j /
where A.j / is the set of users assigned to sub-band j , and the total power used in
all sub-bands is below the maximum available level p :
X X
pij p : (4.2)
j i 2A.j /
sub-bands j . Therefore, the cell will allocate larger powers to its good sub-bands,
thus making those sub-bands “bad” for the neighboring cells. Neighboring cells
then (while trying to minimize their own total powers) will “avoid” putting their
edge user into those sub-bands, making them even “better” for the cell under con-
sideration, and so on. It is intuitive that the system “settles” into a user-to-sub-band
allocation pattern, generally requiring less power in all cells, because neighboring
cells will automatically “separate” their edge users into different sub-bands.
How each BS is going to solve problem (4.3), (4.1), (4.2) is of crucial impor-
tance, because, to make the approach practical, the following requirements need to
be observed:
The algorithm has to be computationally very efficient.
It should not result in a large number of users being re-assigned from sub-band
to sub-band in each time slot.
We want an algorithm which adapts automatically to “evolving” set of users I
and their “evolving” parameters mij and pij .
Such an algorithm that is run separately in each cell can be devised using the
greedy primal-dual algorithm described in [8]. We refer the reader to [5] for the
details.
We present some simulation results from [5] to illustrate how the algorithm performs
in a simple setting. We do not provide all the detailed simulation assumptions here
but refer the reader to the original above source.
A three sector network (as shown in Figure 4.3), with the sectors facing each
other is used to study the behavior of the algorithm under a variety of traffic distri-
bution scenarios. The site-to-site distance is set to 2.5 km. Users are distributed in
the triangular region covered by the three sectors. Standard propagation parameters
54 H. Viswanathan and S. Venkatesan
[9] are used to determine the received signal power level for a given transmit power
level. The propagation parameters result in a cell edge SNR (signal to thermal noise
ratio) of 20 dB, when there is no interference from surrounding cells, and the total
available power is distributed over the entire bandwidth.
A constant bit rate traffic model in which, for all users in the active state, fixed
length packets of size 128 bits arrive once every 20 slots is adopted. Users transition
between active and inactive states according to exponentially distributed waiting
times in each state. No packets arrive when a user is in the inactive state. The mean
time in each state is 200 slots.
In the simulations, an OFDMA system with 48 sub-carriers divided into a number
of sub-bands with the same number of sub-carriers in each sub-band is consid-
ered. Typically, we consider three sub-bands with 16 sub-carriers in each sub-band.
Random frequency hopping is implemented from slot to slot by permuting the sub-
carrier indices independently across the different sub-bands and sectors.
A key requirement for the algorithm is the feedback of channel quality in the
form of data rate that can be supported within each of the sub-bands. Average chan-
nel quality is assumed to be fed back every slot in the simulations. Transmission
rates achieved are computed using the Shannon formula and idealized incremental
redundancy is also simulated.
When the activity state for user i changes from inactive to active and the first
packet arrives at the sector, pij ; mij are calculated for each sub-band j to meet the
target data rate requirement. For this computation the channel quality indicator fed
back by each user on a per sub-band level is used. Among the various combinations
of pij ; mij that result in the same rate, the one that requires the least number of sub-
carriers is used. Determining the optimal combination of pij and mij is non-trivial.
We compare the performance of the shadow algorithm for three sub-bands to that
of universal reuse with a single sub-band and no interference coordination. Users are
distributed uniformly in all three sectors in all of the results in this section. Compari-
son is performed on the basis of the maximum number of users that can be supported
in each sector. This maximum number is obtained from the cumulative distribution
functions (CDFs) of the total sector transmit power and the mean queue size.
Figure 4.4 shows the complementary cumulative distribution functions of the
total sector power normalized by the maximum available sector power and the mean
4 Spectrum and Interference Management in Next-Generation Wireless Networks 55
0
10
Universal 160
Universal 170
Shadow 210
Shadow 220
−1
10
CCDF
−2
10
−3
10
0 0.5 1 1.5 2 2.5 3 3.5 4
Normalized Average Queue Size (# packets)
100
Universal 160
Universal 170
Shadow 210
Shadow 220
10−1
CCDF
10−2
10−3
−4
10
0 0.5 1 1.5 2 2.5 3 3.5
Normalized Total Sector Power
Fig. 4.4 Complementary cumulative distribution functions of the total sector power and mean
queue size for 20 dB cell edge SNR
56 H. Viswanathan and S. Venkatesan
queue size over the duration of the simulation normalized by the packet size for
the different users. The criteria for determining if a given number of users can be
supported by the system are (a) the one-percentile point in normalized total sector
power plot should not exceed 1 for valid system operation and (b) the one-percentile
point in the normalized mean queue size plot should not exceed about 2. The latter
is because of the fact that for real-time traffic such as VoIP traffic, over the air one
way delay of up to 80 ms may be tolerated. This translates to a maximum delay of
about four packets. Since we are using mean delay, two packets delay has been used
as the criterion. From these plots we conclude that the gain for the shadow algorithm
over universal reuse in this case is about 30%
In Figure 4.5 we show the total number of users across the three sectors that
require re-assignment to a different band to make the allocation more efficient as a
function of the index slot. This number is important because reassigning a user to
a different band incurs additional signaling overhead. Thus, the smaller the number
of reassignments the better. From the figure it is clear that the number of users
reassigned is a small fraction, less than 3%, of the total number of users.
In Figure 4.6 we show the relative power spectral density across the three sub-
bands. This is calculated as follows. First the average transmit power level in each
sub-band is normalized by the average number of sub-carriers assigned in that sub-
band. The resulting power spectral densities are then normalized by the smallest
of the three power spectral density values to obtain the relative values which are
12
10
Number of Band Changes
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Slot Index
Sector1
3
Relative Ratio of Avg Power to Avg Sub−Carriers Assigned (dB)
2
0
1 2 3
Sector2
4
3
2
1
0
1 2 3
Sector3
3
0
1 2 3
Subband Index
Fig. 4.6 Relative transmit power ratios (in dB) in the different bands in the three sectors
then plotted in dB. This shows that there is a significant difference in the transmit
power spectral densities across the three sub-bands demonstrating that automatic
soft fractional frequency reuse is indeed taking place.
In the case of non-uniform traffic higher gains are achieved with the proposed
algorithm as illustrated in [5].
As pointed out in Section 4.1, the spectral efficiency achievable in today’s cellular
networks (i.e., the overall system throughput per unit of bandwidth) is fundamen-
tally limited by cochannel interference, i.e., the interference between users sharing
the same time-frequency channel. While intracell interference can be eliminated
by orthogonal multiple access and multiplexing techniques, intercell interference
still remains. Consequently, increasing the signal-to-noise-ratio (SNR) on individual
links does not increase the spectral efficiency of the network appreciably beyond a
point, because the signal-to-interference-plus-noise ratio (SINR) on each link begins
to saturate.
58 H. Viswanathan and S. Venkatesan
Within the SINR limits imposed by this cochannel interference environment, link
performance is already close to optimal, thanks to the use of sophisticated error cor-
recting codes, adaptive modulation, incremental redundancy, etc. [10] While the
SINR distribution can be improved (especially in terms of helping cell-edge users)
by means of the fractional frequency reuse techniques described earlier in this chap-
ter, the resulting spectral efficiency gains are typically modest, as seen in the results
of Section 4.3.
It is therefore clear that major improvements in spectral efficiency for future
generations of cellular networks will require more ambitious approaches to mitigate
cochannel interference. “Network MIMO” [11–20] is one such approach, which can
be regarded as an extension to the network level of traditional link-level MIMO con-
cepts. The basic idea behind Network MIMO is to coordinate several base stations
in the transmission and reception of user signals and to suppress interference be-
tween such users by joint beamforming across the antennas of all the base stations
(possibly augmented by non-linear interference cancellation techniques).
Suppose that the network has several “coordination clusters”, each consisting of
a base station and one or more rings of its neighbors, and that the antennas of all the
base stations in each cluster can act as a single coherent antenna array. It is envisaged
that such coordination between base stations in a cluster would be facilitated by
a high-speed, low-latency (wired or wireless) backhaul network. Each user in the
network is served by one such cluster. The interference affecting each user can then
be suppressed quite effectively by means of coherent linear beamforming at the
antennas of all the base stations in its assigned cluster (transmit beamforming on
the downlink, and receive beamforming on the uplink), thereby greatly increasing
the attainable spectral efficiency [11–20].
It should be noted that Network MIMO on the downlink requires downlink
channel state information at the base stations. In a frequency-division duplexing
(FDD) system, this calls for accurate and timely feedback of such information
from the users, at a significant cost in terms of uplink bandwidth. With time-
division duplexing (TDD), however, channel reciprocity can be exploited to obtain
downlink channel state information directly from uplink channel estimates. In con-
trast, Network MIMO on the uplink does not require any over-the-air exchange
of information beyond what is customary, making it potentially amenable to a
standards-free implementation.
Some thought shows that cochannel interference mitigation through Network
MIMO should have a greater impact on spectral efficiency in a higher-SNR envi-
ronment, since the level of cochannel interference relative to receiver noise is then
higher. The SNR distribution in the network is determined by the transmitter power
available to each user, bandwidth of operation, propagation characteristics of the
environment, antenna gains, amplifier noise figures, etc. Also, it is intuitively clear
that, for a given SNR distribution, there is a diminishing benefit in coordinating
farther and farther base stations.
4 Spectrum and Interference Management in Next-Generation Wireless Networks 59
4.4.1 Algorithms
To illustrate Network MIMO concepts and algorithms, we will focus here on the
uplink [18–20]. Following [18], suppose that the network is populated with one
user per base station antenna (i.e., one user per spatial dimension), and that all these
users, but for a small fraction consigned to outage due to unfavorable channel condi-
tions, must be served at a common data rate. Results on power control from [23–25]
can be used to develop algorithms for identifying the subset of users that must be
declared in outage, as well as the powers at which the remaining users must transmit
and the coordination clusters at which they must be received. The largest common
data rate that is consistent with the desired user outage probability can then be de-
termined by simulation, for coordination clusters of different sizes.
Several idealizing assumptions are made in [18] in order to get a sense of the
potential payoff without getting bogged down in details. For example, channel es-
timation issues are ignored by assuming the availability of perfect channel state
information wherever necessary. Also, the bandwidth, latency, and synchronization
requirements on the backhaul network connecting the base stations in each cluster
are not dealt with. All such issues will need to be solved before Network MIMO can
be deployed in real-world networks.
In the interest of simplicity, all user-to-sector links in the network are assumed
flat-fading and time-invariant in [18], with perfect symbol synchronization between
all users at each sector (i.e., the symbol boundaries in all the user signals are as-
sumed to be aligned at each sector). Further, each user in the network has a single
omnidirectional transmitting antenna. Accordingly, the complex baseband signal
vector ys .t/ 2 CN received at the N antennas of sector s during symbol period
t is modeled as
XU
ys .t/ D hs;u xu .t/ C zs .t/: (4.4)
uD1
Here, U is the total number of users in the network; xu .t/ 2 C is the complex
baseband signal transmitted by user u during symbol period t; hs;u 2 CN is the
vector representing the channel from user u to sector s; and zs .t/ 2 CN is a circu-
larly symmetric complex Gaussian vector representing additive receiver noise, with
E Œzs .t/ D 0 and E zs .t/zs .t/ D I. Each user is subject to a transmitted power
constraint of 1, i.e., EŒjxu .t/j2 1.
A coordination cluster is defined to be a subset of base stations that jointly and
coherently process the received signals at the antennas of all their sectors. The
network is postulated to have a predefined set of coordination clusters, and each
user can be assigned to any one of these clusters. Further, each cluster uses a lin-
ear minimum-mean-squared-error (MMSE) beamforming receiver to detect each
user assigned to it, in the presence of interference from all other users in the net-
work (more generally, receivers based on interference cancellation could also be
considered).
60 H. Viswanathan and S. Venkatesan
a b
(a) r D 0 (b) r D 1
c d
(c) r D 2 (d) r D 4
To highlight the dependence of the spectral efficiency gain on the number of rings
of neighbors with which each base station is coordinated, coordination clusters with
a specific structure are of interest. For any integer r 0, an r-Ring coordination
cluster is defined to consist of any base station and the first r rings of its neighboring
base stations (accounting for wraparound), and Cr is defined to be the set of all
r-ring coordination clusters in the network. Figure 4.7 illustrates r-ring clusters for
r D 0; 1; 2; 4.
With C0 as the set of coordination clusters in the network, there is in fact no
coordination between base stations. This case serves as the benchmark in estimating
the spectral efficiency gain achievable with sets of larger coordination clusters. With
some abuse of notation, let hC;u 2 C3N jC j denote the channel from user u to the
antennas of all the base stations in the coordination cluster C (here jC j denotes the
number of base stations in C ). Then, with user u transmitting power pu , the SINR
P 1
attained by user u at cluster C is hC;u I C v¤u pv hC;v hC;v hC;u pu . Note that
this expression assumes perfect knowledgeP at cluster C of the channel vector hC;u
and the composite interference covariance v¤u pv hC;v hC;v .
4 Spectrum and Interference Management in Next-Generation Wireless Networks 61
Let the target rate for each user in the network be R bits/sym. Since there are 3N
users per cell, the offered load to the network is then 3NR bits/sym/cell. Assuming
Gaussian signaling and ideal coding, the target rate of R bits/sym translates to a
target SINR of , 2R 1 for each user.
To begin with, suppose that the target SINR is small enough for all the users
to achieve it, given the power constraint on each user and the interference between
users. This means that there exists a feasible setting of each user’s transmitted power,
and an assignment of users to coordination clusters, such that each user attains an
SINR of or higher at its assigned cluster, with an SINR-maximizing linear MMSE
receiver. In this situation, the following iterative algorithm from [23] (also see [24,
25]) can be used to determine the transmitted powers and cluster assignments for all
the users:
1. Initialize all user powers to 0: pu.0/ D 0 for all u.
2. Given user powers fpu.n/ g, assign user u to the cluster Cu.n/ where it would attain
the highest SINR:
Let pu.nC1/ be the power required by user u to attain the target SINR of at
the cluster Cu.n/ , assuming every other user v continues to transmit at the current
power level:
1
pu.nC1/ D h .n/ Q.n/.n/ hC .n/ ;u : (4.7)
Cu ;u Cu ;u u
target SINR . For this subset of users, the algorithm then finds the optimal trans-
mitted powers and cluster assignments. However, the user subset itself need not be
the largest possible; essentially, this is because a user consigned to outage in some
iteration cannot be resurrected in a future iteration.
Figures 4.8–4.10 illustrate the spectral efficiency gain achievable with different co-
ordination cluster sizes, for N D 1, N D 2, and N D 4, respectively. Specifically,
each figure shows the ratio of the spectral efficiency achievable with C1 (1-ring coor-
dination), C2 (2-ring coordination), and C4 (4-ring coordination) to that achievable
with C0 (no coordination), for a different value of N . Note that:
1. The coordination gain increases with the reference SNR in each case, because
interference mitigation becomes more helpful as the level of interference between
users goes up relative to receiver noise.
2. At the low end of the range, most of the spectral efficiency gain comes just from
1-ring coordination. This is because most of the interferers that are significant
relative to receiver noise are within range of the first ring of surrounding base
stations. However, as is increased, interferers that are further away start to
become significant relative to receiver noise, and therefore it pays to increase the
coordination cluster size correspondingly.
3. The coordination gain values are not very sensitive to N , the number of antennas
per sector as well as the number of users per sector, suggesting that it is the ratio
of users to sector antennas (1 in all our results) that matters.
The results from the simulations indicate that, in a high-SNR environment, the
uplink spectral efficiency can potentially be doubled with 1-ring coordination, and
nearly quadrupled with 4-ring coordination. When the user-to-sector-antenna ratio
is smaller than 1, the coordination gain will be somewhat lower since, even with-
out coordination, each base station can then use the surplus spatial dimensions to
2−ring coord.
3
1−ring coord.
2.5
1.5
Fig. 4.8 Coordination gain: 6 12 18 24 30
1 antenna/sector, 1 user/sector Reference SNR at cell edge (dB)
4 Spectrum and Interference Management in Next-Generation Wireless Networks 63
Coordination gain
2−ring coord.
3
1−ring coord.
2.5
1.5
6 12 18 24 30
Reference SNR at cell edge (dB)
3
1−ring coord.
2.5
1.5
6 12 18 24 30
Reference SNR at cell edge (dB)
suppress a larger portion of the interference affecting each user it serves. The co-
ordination gain with a user-to-sector-antenna ratio larger than 1 will also be lower,
because the composite interference affecting each user at any coordination cluster
will then tend towards being spatially white, making linear MMSE beamforming
less effective at interference suppression.
The FFR algorithm for the CBR case presented here is fairly simple to implement.
Additional computational resources would of course be required to perform the
computations required. One of the challenges common to all interference manage-
ment techniques is the need for appropriate feedback from the mobile. Because
interoperability between mobiles and base stations from different vendors is re-
quired, the quantity that is fed back has to be standardized.
The FFR algorithm presented applies only to the CBR case. In practice, a mixture
of different traffic types will be involved. A hard separation of resources between
64 H. Viswanathan and S. Venkatesan
CBR and best effort type traffic will be suboptimal. Thus an algorithm that seams to-
gether the CBR and best effort solution in an adaptive manner is required in practice.
As far as Network MIMO goes, several issues will need to be addressed before
the large spectral efficiency gains hinted at by theoretical results can be realized in
practice. For example, techniques must be developed to estimate the channel from
a user to a faraway base station without excessive overhead for training signals,
especially in a highly mobile environment (data-aided channel estimation methods
could be investigated for this purpose).
Perhaps, most importantly, a high-bandwidth, low-latency backhaul network will
be required for several base stations to jointly process transmitted and received
signals in a timely manner (coherent processing also requires a high degree of syn-
chronization between the base stations). To get a rough idea of the required increase
in backhaul bandwidth, consider the downlink. Without Network MIMO, the data
to be transmitted to a user is routed to a single base station. Now, with Network
MIMO, if the average user is served by B base stations, then the overall backhaul
bandwidth required for user data dissemination will increase by a factor of B (since
the average user’s bits must be routed to B base stations instead of 1). In addition,
the exchange of channel state and beamforming weight information between base
stations (or between base stations and a central Network MIMO processor) will also
require additional backhaul bandwidth, but this will typically be small compared to
what the user data requires.
The costs associated with such a network must be considered in relation to the
savings from the greater efficiency in the use of scarce spectrum. More generally,
Network MIMO must be compared in economic terms with alternative approaches
to increasing spectral efficiency.
4.6 Summary
References
1. 3GPP Specification TR 36.913: Requirements for further advancements for E-UTRA (LTE-
Advanced). Available from http://www.3gpp.org/ftp/Specs/html-info/36913.htm (2008)
2. IEEE 802.16m-07/002r7: IEEE 802.16m System Requirements. Available from http:// wire-
lessman.org/tgm/ (2008)
4 Spectrum and Interference Management in Next-Generation Wireless Networks 65
3. Buddhikot, M.: Understanding dynamic spectrum access: models, taxonomy and challenges.
Proc. IEEE DYSPAN. (March 2007)
4. Das, S., Viswanathan, H.: Interference mitigation through intelligent scheduling. Proc. Asilo-
mar Conference on Signals and Systems. (November 2006)
5. Stolyar, A. L., Viswanathan, H.: Self-organizing Dynamic Fractional Frequency Reuse in
OFDMA Systems. Proc of INFOCOM. (April 2008)
6. Third Generation Partnership Project 2: Ultra Mobile Broadband Technical Specifications.
http://www.3gpp2.org (March 2007)
7. Third Generation Partnership Project: Radio Access Network Work Group 1 Contributions.
http://www.3gpp.org (September 2005)
8. Stolyar, A. L.: Maximizing queueing network utility subject to stability: greedy primal-dual
algorithm. Queueing Systems (2005) 401–457
9. Stolyar, A. L., Viswanathan. H.: Self-organizing dynamic fractional frequency reuse in
OFDMA Systems. Bell-Labs Alcatel-Lucent Technical Memo (June 2007) http://cm.bell-
labs.com/who/stolyar/dffr.pdf
10. Huang, H., Valenzuela, R. A.: Fundamental Simulated Performance of Downlink Fixed Wire-
less Cellular Networks with Multiple Antennas. Proc. IEEE PIMRC (2005) 161–165
11. Shamai, S., Zaidel, B. M.: Enhancing the cellular downlink capacity via co-processing at the
transmitting end. Proc. IEEE Veh. Tech. Conf. (Spring 2001) 1745–1749
12. Karakayali, K., Foschini, G. J., Valenzuela, R. A., Yates, R. D.: On the maximum common rate
achievable in a coordinated network. Proc. IEEE ICC (2006) 4333–4338
13. Karakayali, K., Foschini, G. J., Valenzuela, R. A.: Network coordination for spectrally efficient
communications in cellular systems. IEEE Wireless Commun. Mag. 13:4 (2006) 56-61
14. Foschini, G. J., Karakayali, K., Valenzuela, R. A.: Coordinating multiple antenna cellular net-
works to achieve enormous spectral efficiency. Proc. IEEE 153:4 (2006) 548-555
15. Somekh, O., Simeone, O., Bar-Ness, Y., Haimovich, A. M.: Distributed Multi-Cell Zero-
Forcing Beamforming in Cellular Downlink Channels. Proc. IEEE Globecom (2006) 1–6
16. Jing, S., Tse, D. N. C., Soriaga, J. B., Hou, J., Smee, J. E., Padovani, R.: Downlink Macro-
Diversity in Cellular Networks. Proc. Int’l Symp. Info. Th. (2007)
17. Ng, B. L., Evans, J., Hanly, S.: Distributed Downlink Beamforming in Cellular Networks. Proc.
Int’l Symp. Info. Th. (2007)
18. Venkatesan, S.: Coordinating Base Stations for Greater Uplink Spectral Efficiency in a Cellular
Network. Proc. PIMRC (2007)
19. Venkatesan, S.: Coordinating Base Stations for Greater Uplink Spectral Efficiency: Proportion-
ally Fair User Rates. Proc. PIMRC (2007)
20. Venkatesan, S., Lozano, A., Valenzuela, R. A.: Network MIMO: Overcoming Intercell Inter-
ference in Indoor Wireless Systems. Proc. Asilomar Conf. on Signals, Systems and Computers
(2007)
21. Wolniansky, P. W., Foschini, G. J., Golden, G. D., Valenzuela, R. A.: V-BLAST: an architecture
for realizing very high data rates over the rich-scattering wireless channel. Proc. 1998 URSI
International Symposium on Signals, Systems and Electronics (1998) 295–300
22. Golden, G. D., Foschini, G. J., Valenzuela, R. A., Wolniansky, P. W.: Detection algorithm
and initial laboratory results using V-BLAST space-time communication architecture. IEEE
Electronics Letters 35 (1999) 14–16
23. Rashid-Farrokhi, F., Tassiulas, L., Liu, K. J. R.: Joint optimal power control and beamforming
in wireless networks using antenna arrays. IEEE Trans. Commun. 46 (1998) 1313–1324
24. Hanly, S.: An algorithm for combined cell-site selection and power control to maximize cellular
spread spectrum capacity. IEEE J. Sel. Areas. Commun. 13 (1995) 1332–1340
25. Yates, R., Huang, C. Y.: Integrated power control and base station assignment. IEEE Trans.
Veh. Tech. 44 (1995) 638–644
Chapter 5
Cross-Layer Capacity Estimation
and Throughput Maximization
in Wireless Networks
5.1 Introduction
With rapid advances in wireless radio technology, there has been a significant
growth of interest in various types of large-scale multi-hop wireless networks such
as mesh networks, sensor networks, and mobile ad hoc networks. Effective de-
ployment and operation of these multi-hop wireless networks demands a thorough
grasp over the following fundamental issues: What is the rate at which data can be
transferred across a multi-hop wireless network (this is also known as the through-
put capacity)? How can the network communication protocols be designed in order
to achieve the maximum possible rate? The throughput capacity is a function of
a number of factors including the topology of the network, the traffic pattern, and,
most significantly, the specific communication protocols employed for data transfer.
Network communication protocols are commonly structured in a layered manner –
this is a fundamental architectural principle that has guided the design of protocols
for data networks in general, and the Internet in particular since their inception.
Each layer in the protocol stack performs a collection of related functions, with the
goal of providing a specific set of services to the layer above it and receiving a set
of services from the layer below it. The task of characterizing and maximizing the
throughput capacity of a network involves the joint analysis and optimization of
several layers of the protocol stack.
It is well understood that imposing a strict separation between various layers of
the protocol stack by treating them as isolated units leads to a significantly sub-
optimal performance in wireless networks [16]. As a result, there is a considerable
amount of emphasis on the cross-layer protocol design, which is characterized by
tighter coupling between various layers: information associated with a specific layer
could be exposed across other layers if such sharing could potentially lead to in-
creased efficiency of some network functions. The main goal here is to determine
how best to jointly control various parameters associated with all the layers, and how
to operate the layers in synergy in order to optimize some global network objective
of interest. There are two key factors at work behind the shift towards cross-layer
design in the domain of wireless networking as opposed to the Internet. First, the
complexity of the wireless communication channel, its unreliability, and temporal
changes (due to fading, mobility, multipath, and other wireless propagation effects)
as well as the phenomenon of wireless interference necessitates a holistic and a
global, optimization-centric view of protocols. Second, the baggage of legacy sys-
tems does not pose as much of an issue in ad hoc wireless networks as it does in the
Internet. This permits a clean-slate approach to network design and control which is
not an option in the present-day Internet.
There has been considerable amount of theoretical research on cross-layer for-
mulations to characterize the set of traffic rates (also referred to as the rate region)
that can be supported by the network, as well as the design of cross-layer protocols
motivated by such analysis. This research has witnessed a confluence of diverse per-
spectives which have their bearings in convex optimization theory [22,23,25,44,80],
queueing theory [61, 62, 64, 74, 76], geometric insights [4, 13, 48–50, 79], graph
theory [68], and systems research [28,29,82] (these references indicate a small sam-
ple; see also the references therein). The potential benefits due to cross-layer design
have been demonstrated in the context of several central network performance mea-
sures including end-to-end throughput related objectives [4, 44, 49, 50, 76, 79, 80],
energy minimization [26, 27, 53], end-to-end latency [19, 41, 48, 70], and network
longevity [59, 77].
5 Cross-Layer Issues in Wireless Networks 69
scheme has been generalized in several ways [61, 64, 73, 75] and is an essential
component of many other policies that optimize other performance objectives.
While this algorithm is valid in a very general setting, it requires computing a
maximum weight interference free set from among a given set of links, which is
NP-complete in general; however, polynomial time (approximate) extensions of
the algorithm are known for several geometric interference models including the
one described in Section 5.2.
The field of cross-layer design and control is a vast and rapidly expanding re-
search area and we acknowledge our inability to survey all the developments in
this field, let alone dive into them in depth. Our afore-mentioned choice of the
three topics for detailed exploration was guided by the fact that, in addition to in-
fluencing much subsequent research in the field of cross-layer design, the specific
algorithms we present are also relatively simple to present and demand little other
than a knowledge of basic probability as a prerequisite. We present a brief survey
of several other major developments in the field of cross-layer design in Section 5.6
and conclude in Section 5.7 by presenting a set of open problems for future research.
Several excellent tutorials on cross-layer design are now available which present di-
verse perspectives on the field. Especially notable are the tutorials due to Chiang et
al. [25] and Lin et al. [56] which present an optimization based perspective on cross-
layer network design and architecture, and the tutorial due to Georgiadis et al. [33]
which presents a queueing theoretic perspective on cross-layer resource allocation.
The role of geometric insights into the design of provably good analytical and al-
gorithm design techniques for cross-layer design is an important perspective which
has received relatively little attention in the above tutorials. We hope to redress this
imbalance through this work.
In this section, we define the basic models and concepts used in this chapter. We
consider ad hoc wireless networks. The network is modeled as a directed graph
G D .V; E/. The nodes of the graph correspond to individual transceivers and a
directed edge (link) .u; v/ denotes that node u can transmit to node v directly. A link
.u; v/ is said to be active at time t if there is an ongoing transmission from u to
v at time t. Link e in G D .V; E/ has a capacity c.e/ bits/sec and denotes the
maximum data that can be carried on e in a second. We assume that the system
operates synchronously in a time slotted mode. For simplicity, we will assume that
each time slot is 1 s in length. Therefore, if a link e is active in a time slot, then
c.e/ bits of information can be transmitted over that link during the slot. The results
we present are not affected if the length of each slot is an arbitrary quantum of time
instead of one unit.
5 Cross-Layer Issues in Wireless Networks 71
D4
D2
D3
D1
u w x
Fig. 5.1 An example illustrating the Tx-model. D1 and D2 are disks centered at u of radii range.u/
and .1 C /range.u/, respectively. Similarly, D3 and D4 are disks centered at w of radii range.w/
and .1 C /range.w/, respectively. The transmissions .u; v/ and .w; x/ can be scheduled simulta-
neously because jju; wjj .1 C / .range.u/ C range.w// (since the disks D2 and D4 do not
intersect)
added constraint that transmissions on links that interfere with each other cannot be
scheduled simultaneously. Thus the task of finding optimal multicommodity flows
in wireless networks becomes considerably more complicated. In any such problem
involving dynamic packet injections, a central question is that of stability: a proto-
col is said to be stable if every packet incurs a bounded delay and, consequently, all
buffers have bounded sizes. We seek stable cross-layer protocols that jointly opti-
mize the end-to-end rate allocation, routing, and the scheduling components in order
to maximize the throughput objective of interest to us.
1
Note that the size of the square does not vary with n. The results discussed here will change if the
size of the square varies with n.
2
EŒ denotes expectation over a distribution which is usually clear from the context.
74 V.S. Anil Kumar et al.
Let .h; b; s/ be the binary indicator variable which is set to one if the hth hop
of bit b occurs during slot s. Since at most half the network nodes can act as a
transmitter during any slot, we have
X h.b/
nT X Wn
.h; b; s/ (since
, the slot length, is 1 s). (5.2)
2
bD1 hD1
: X
nT
WTn
H D h.b/ : (5.3)
2
bD1
From the definition of the Tx-model of interference, disks of radius .1C/ times
the lengths of hops centered at the transmitters are disjoint. Ignoring edge effects,
all these disks are within the unit square. Since at most W bits can be carried in slot
s from a transmitter to a receiver, we have:
X h.b/
nT X
.h; b; s/.1 C /2 .rbh /2 W 1: (5.4)
bD1 hD1
Note that the unit on each side of the above Equation is bit-meter2 . Summing over
all the slots gives
X h.b/
nT X
.1 C /2 .rbh /2 W T (5.5)
bD1 hD1
)
X h.b/
nT X 1 h 2 WT
.rb / : (5.6)
H .1 C /2 H
bD1 hD1
Hence, s
X h.b/
nT X 1 h WT
.r / H: (5.7)
H b .1 C /2
bD1 hD1
1 1 p
nL p W n bit-meters / second (5.9)
2 .1 C /
76 V.S. Anil Kumar et al.
the source and destination of exactly k packets. The k k permutation routing prob-
lem is that of routing all the k`2 packets to their destinations. The following result
characterizes the achievable performance of k k permutation routing algorithms.
Lemma 5.1. ([40,52]) k k permutation routing in a ` ` mesh can be performed
deterministically in k`
2
C o.k`/ steps with maximum queue size at each processor
equal to k.
We are ready to describe the scheduling strategy for our random wireless network
by reducing it to a permutation routing problem on an ` ` mesh. We first map the
nodes in each specific squarelet in the random wireless network onto a particular
processor in the mesh by letting ` D s1n . Next, we let each node have m packets and
since a squarelet has a maximum of cn nodes, we set k D mcn . By fixing the buffer
size of each node to be k D mcn , we are ready to map the routing and scheduling
algorithm of Lemma 5.1 to the wireless network.
Each processor in the mesh can transmit and receive up to four packets in the
same slot. However, in the wireless network, communication is restricted between
users in neighboring squarelets and only nodes in the same equivalence class of
squarelets can transmit simultaneously during a slot. Further, a user can either trans-
mit or receive at most one packet in each slot. We now serialize the routing protocol
on the mesh network to meet the above constraints. First, we serialize the transmis-
sions of the processors that are not in the same equivalence class. Since there are K 2
equivalence classes in all, this expands the total number of steps in the mesh routing
algorithm by a factor of K 2 . Next, we serialize the transmissions of a single proces-
sor. Since there are no more than four simultaneous transmissions of any processor,
this increases the total number of steps in the mesh routing by a further factor of 4.
Thus, we can map the mesh routing algorithm to the wireless network and con-
clude that m packets of each of the n2 nodes reach their destination in number of
2K 2 mcn
slots equal to 4K 2 k`
2
D sn
. This yields the following proposition.
Proposition 5.1. Assuming each squarelet has at least one node, the per-connection
throughput for a network with squarelet size sn and crowding factor cn is ˝. csnn /.
q
We now conclude our proof by showing that, if we set sn D 3 log n
n , then with
high probability, no squarelet is empty and cn 3e log n. We begin by proving
the non-empty squarelets claim. The probability that any fixed squarelet is empty
is equal to .1 sn2 /n . From a simple union bound and using the fact that there are
1
s2
squarelets, we conclude that the probability that at least one squarelet is empty
n
.1s 2 /n
is upper-bounded by s 2n . Combining this with the fact that 1 x e x , and
q n
3 log n
sn D n , we see that the probability of at least one squarelet being empty is at
most n12 , which yields a high probability of 1 n12 for the complementary event.
We now prove the bound on the crowding factor. Fix a specific squarelet. The
number of nodes in this squarelet is a binomial random variable (say, Zn ) with
parameters .sn2 ; n/. By Chernoff–Hoeffding bounds [24, 38], for any a > 0 and
> 0, we have
78 V.S. Anil Kumar et al.
EŒe Zn
Pr ŒZn > a log n ;
e a log n
EŒe Zn D .1 C .e 1/ sn2 /n
n3.e 1/ ;
1
Pr ŒZn > 3e log n 3 (setting D 1 and a D 3e);
n
1
Pr Œcn 3e log n 2 (by a simple union bound).
n
The above bound on the crowding factor, along with the value of sn and
Proposition 5.1, yields the requisite result.
3
Recall that an ˛-approximation algorithm for a maximization problem is one which is always
guaranteed to produce a solution whose value is within a factor of ˛ from that of the optimal
solution.
5 Cross-Layer Issues in Wireless Networks 79
approximation algorithm. The basic idea behind the inductive scheduling scheme
is to first perform a total ordering of the links according to a precedence function,
and schedule the links sequentially according to this order. By suitably changing
the precedence function defined on the links, Kumar et al. [49, 50] show how the
inductive scheduling scheme leads to provably good algorithms for a variety of ge-
ometric interference models. We begin by establishing certain necessary conditions
which must be satisfied by the link-rate vector in order for it to be scheduled by any
link-scheduling algorithm. This also highlights the geometric underpinnings of the
inductive scheduling scheme.
Recall that for a link e D .u; v/ 2 E, I.e/ denotes the set of links which interfere
with e. Let I.e/ be defined as follows.
Definition 5.1. I .e/ D f.p; q/ W .p; q/ 2 I.e/ and jjp; qjj jju; vjjg.
I .e/ is the subset of links in I.e/ which are greater than or equal to e in length.
Let Xe;t be the indicator variable which is defined as follows:
1 if e transmits successfully at time t
Xe;t D (5.10)
0 otherwise:
where is a fixed constant that depends only on the interference model. In particular,
for the Tx-model of interference, D 5.
Proof. We first note that in the Tx-model of interference, we may treat interfer-
ence as occurring between transmitting nodes rather than links, since the interfer-
ence condition depends solely on the transmission ranges of the transmitters and
the distance between them. For any node u, define I.u/ and I .u/ analogous to the
definition for links as follows: I.u/ D fw W jju; wjj < .1 C / .range.u/ C
range.w//g. I .u/ D fw W range.w/ range.u/ and w 2 I.u/g. For any link
e D .u; u0 /, I.e/ is now defined as follows: I.e/ D fe 0 D .w; v/ W w 2 I.u/g.
Similarly, I .e/ D fe 0 D .w; v/ W w 2 I .u/g. In order to complete the proof of
the claim, we only need to show that for any node u, at most five nodes in I .u/ can
simultaneously transmit in any time slot without interfering with each other.
Consider any node u and a large disk centered at u which contains all the
nodes in the network. Consider any sector which subtends an angle of 3 at u.
Let w; w0 2 I .u/ be two nodes in this sector. Without loss of generality, assume
80 V.S. Anil Kumar et al.
that jju; w0 jj jju; wjj. It is easy to see that jjw; w0 jj jju; w0 jj. Further, we have
range.w0 / range.u/. Thus, w0 has a bigger range than u and is closer to w than u.
Since u and w interfere with each other, clearly, w and w0 also each interfere with
each other and hence cannot transmit simultaneously. Thus the angle subtended at
u by any two simultaneous transmitters in the set I .u/ is strictly greater than 3 .
Hence, there can be at most five successful transmitters from this set which proves
the claim.
Let f.e/ denote a link-flow vector which specifies the rate f .e/ which must be
supported on each link e. Define the utilization of link e to be fc.e/ .e/
: this is the
fraction of the time link e is successfully transmitting data in order to meet its rate re-
quirement. Let x denote the corresponding link-utilization vector whose components
are the x.e/-values. Taking a time-average of Equation (5.11) yields the following
lemma which imposes a simple necessary condition for link-flow stability.
In this section we present the inductive scheduling algorithm for scheduling a link-
flow vector, whose corresponding link-utilization vector is x. In Section 5.4.1.3, we
analyze conditions under which this algorithm yields a stable schedule (and hence
sufficient conditions for link-flow stability). The algorithm works as follows: time
is divided into uniform and contiguous windows or frames of length w, where w is
a sufficiently large positive integer such for all e, w x.e/ is integral. The algorithm
employs a subroutine called frame-scheduling which specifies a schedule for each
link e within each frame. This schedule is repeated periodically for every frame to
obtain the final schedule. We now present the details of the frame-scheduling algo-
rithm whose pseudo-code is presented in Algorithm 1 (referred to as I NDUCTIVE
S CHEDULER).
1: for all e 2 E do
2: s.e/ D ˚
3: Sort E in decreasing order of length of the links.
4: for i = 1 to jEj do
5: e = EŒi S
6: s 0 .e/ D f 2I.e/ s.f /
7: s.e/ = any subset of W n s 0 .e/ of size w x.e/
Consider a single frame W whose time slots are numbered f1; : : : ; wg. For each
link e, the subroutine assigns a subset of slots s.e/
W such that the following
hold:
1. js.e/j D w x.e/, i.e., each link receives a fraction x.e/ of time slots.
2. 8f 2 I.e/, s.f / \ s.e/ D ˚, i.e., two links which interfere with each other are
not assigned the same time slot.
The pseudo-code for sequential frame-scheduling is provided in Algorithm I NDUC -
TIVE S CHEDULER. For all links e 2 E, the set s.e/ (set of time slots in W which are
currently assigned to e) is initialized to ˚. Links in E are processed sequentially in
the decreasing order of their lengths. Let the current link being processed be e. Let
s 0 .e/ denote the set of time slots in W which haveS already been assigned to links in
I.e/ (and hence cannot be assigned to e): s 0 .e/ D f 2I .e/ s.f /. In the remaining
slots W n s 0 .e/, we choose any subset of w x.e/ time slots and assign them to s.e/.
The running time of this algorithm depends on w, and if the x.e/’s are arbitrarily
small, w could become exponential in n in order to satisfy this condition (this still
does not affect the stability of the scheduling algorithm). However, as we discuss in
Section 5.4.1.3, this issue can be addressed through a simple flow scaling technique
which ensures that the link utilizations are not “too small”; this allows us to bound
w by a polynomial in n and consequently the I NDUCTIVE S CHEDULER algorithm
runs in polynomial time.
which contradicts our assumption. This completes the proof of the lemma. t
u
Suppose we have a set of end-to-end connections with rates ri between each si ; ti
pair. Let ri .e/ denote the amount of connection i flow that is carried by link e. For
P
each e 2 E, let f .e/ D i ri .e/ denote the total flow on link e, and let x.e/ D fc.e/
.e/
denote its utilization. Assuming that the packet injections by the connections into
the network are uniform over time, if x satisfies the conditions of Lemma 5.3, the
following result shows that we get a stable schedule, i.e., each packet is delivered in
a bounded amount of time to its destination.
Observation 5.1 If the vector x above satisfies the conditions of Lemma 5.3, each
packet is delivered in at most W n steps.
Proof. Assume that W is such that W rc.e/
i .e/
is integral for each i and e. Consider any
connection i . The number of packets injected for this connection during the window
of W is exactly ri W . For each link e, partition the W x.e/ slots into rc.e/
i .e/
W slots
for each connection i . Then, clearly, for each connection i , each packet can be made
to move along one link in W steps. This completes the proof. t
u
We now combine the ideas presented thus far in this section in the form of the
I NDUCTIVE LP. The solution to this linear program, along with the inductive
scheduling algorithm, yields a provably good solution to the MAXFLOW prob-
lem. The I NDUCTIVE LP formulation is presented below.
In the formulation, Pi denotes the set of all paths between source si and desti-
nation ti of connection i . For any p 2 Pi , r.p/ denotes the data rate associated
with the path p: this is the rate at which data is transferred from si to ti along p. Re-
denotes the total rate at which source si injects data for destination ti : i.e.,
call that ri P
thus ri D pW p2Pi r.p/; for any link e 2 E, x.e/ denotes the total utilization of e.
X
max wi ri subject to
i 2C
X
8i 2 C; ri D r.p/
p2Pi
5 Cross-Layer Issues in Wireless Networks 83
8i 2 C; 8j 2 C n fi g; ri rj ;
P
pW e2p r.p/
8e 2 E; x.e/ D ;
c.e/
X
8e 2 E; x.e/ C x.f / 1;
f 2I .e/
8i 2 C; 8p 2 Pi ; r.p/ 0:
We make the following observations about the above LP. First, we observe
that the size of this program may not be polynomial in the size of the network G
as there could be exponentially many paths Pi . However, using standard tech-
niques, the same program could be equivalently stated as a polynomial-size network
flow formulation [2]; we choose to present this standard formulation here for ease
of exposition. Next, we note that the stability conditions derived in Lemmas 5.2 and
5.3 are crucial for modeling the effect of interference in the LP and still guarantee
a constant-factor performance ratio. Specifically, in the I NDUCTIVE LP, the fourth
set of constraints capture wireless interference. These constraints along with the
I NDUCTIVE S CHEDULER algorithm ensure that the data flows computed by the LP
can be feasibly scheduled. Further, the objective value of this LP is at most a con-
stant factor away from an optimal solution because of the following reason: suppose
the optimal schedule induces a utilization x .e/ on each link e; these rates need to
satisfy the conditions of Lemma 5.3. Hence, scaling down the optimal end-to-end
rates and hence the optimal link rates by a factor (the constant which appears in
Lemma 5.2) results in a feasible solution to the I NDUCTIVE LP. These observations
lead to the following theorem.
paths. Let m0 D maxfk; mg. Next, define .x0 ; r0 / where r 0 .p/ D maxfr.p/; 1=m05 g,
i.e., this solution is obtained by rounding up the flow on paths which carry very
low flowP (i.e., less than 1=m05 ) to 1=m05 . This rate vector could be such that
x .e/ C f 2I .e/ x 0 .f / exceeds 1 for some edges e. However, we can show
0
P
that for each e 2 E, we have x 0 .e/ C f 2I .e/ x 0 .f / 1 C O.1=m0/. This
is because there are at most O.m03 / paths in all, and even if all of these paths
02
are rounded up, they would still carry a total flow of at most
0 02 0
P O.1=m 0 /. This
implies that x .f / x.f / C 1=m . Therefore x .e/ C f 2I .e/ x .f /
P
x.e/ C f 2I .e/ x.f / C m0 O.1=m02 / 1 C O.1=m0 /. It now follows that the
solution .x00 ; r00 / defined by r 00 .p/PD r 0 .p/=.1 C O.1=m0 // is feasible, which leads
to a total throughput of at least i wi ri =.1 C O.1=m0 //. Since x 00 .e/ 1=m05,
it follows from our discussion in Section 5.4.1.2 that the algorithm I NDUCTIVE
S CHEDULER runs in polynomial time.
4
The network layer capacity region consists of all connection rate vectors that can be stably sched-
uled by any routing and scheduling strategy. This notion is made rigorous in Section 5.5.1.
5
For ease of exposition, we will not consider the time varying aspects of the algorithm in this
chapter.
5 Cross-Layer Issues in Wireless Networks 85
description of the network layer capacity region in Section 5.5.1. We then describe
the dynamic back-pressure scheme in Section 5.5.2 and prove its stability property
in Section 5.5.3.
Consider a system comprising of a single queue. Let A.t/ denote its arrival process:
i.e., A.t/ denotes the amount of new data entering the queue at slot t; we assume
that this data arrives at the end of slot t and hence cannot be serviced during that
slot. Let denote the transmission rate of the queue (this is the amount of data that
the queue can service during each slot). Let U.t/ represent the backlog of the queue
at the beginning of slot t. This process evolves according to the following equation:
t 1
1X
lim sup EŒU.
/ < 1:
t !1 t D0
In the setting of our wireless network, let the exogenous arrival process Ai .t/
denote the amount of data generated by connection i at the source si at slot t; in
order to analyze network capacity, we assume that the Ai .t/ processes satisfy the
following properties for admissible inputs.
t 1
1X
lim EŒA.
/ D :
t !1 t
D0
2. Let H.t/ represent the history until time t, i.e., all events that take place during
slots f0; : : : ; t 1g. There exists a finite value Amax such that EŒA2 .t/ j H.t/
A2max for all slots t and all possible events H.t/:
3. For any ı > 0, there exists an interval size T , possibly dependent on ı, such that
for any initial time t0 :
" T 1
#
1 X
E A.t0 C k/jH.t0 / C ı:
T
kD0
86 V.S. Anil Kumar et al.
Definition 5.4. The network layer capacity region is the closure of the set of
all arrival rate vectors hi i that can be stably serviced by the network, considering
all possible strategies available for the network controller to perform routing and
scheduling, including those strategies with perfect knowledge of future arrivals.
In order to construct the network layer capacity region, we first consider the ca-
pacity region of a wired network with no interference constraints. A wired network
is characterized by a constant matrix G.u;v/ where G.u;v/ is the fixed rate at which
data can be transferred over link .u; v/ of the network. The network capacity region
in this scenario is described by the set of all arrival rate vectors .i / which satisfy
the following constraints [2]:
X
8i; i D r.p/;
p2Pi
X
8e 2 E; y.e/ D r.p/;
pW e2p
8e 2 E; y.e/ Ge ;
8i; 8p 2 Pi ; r.p/ 0:
In the above linear program, Pi denotes the set of all paths between si and ti in
the network, and r.p/ where p 2 Pi denotes the amount of traffic carried by path
p for connection i . In the language of network flows, the above program merely
states that there is a feasible multicommodity flow to support the connection rate
vector hi i. The crucial distinction between a wired network and a wireless net-
work with interference constraints is that the achievable link rates during each time
slot in a wireless network are not fixed, but depend on the network controller’s ac-
tions which decide the set of links to be activated during each slot. Thus, instead
of characterizing the wireless network using a single link rate matrix G, we use a
collection of link-rate matrices . The set can be thought of as the set of all long-
term link transmission rate matrices G that can be achieved by the network through
a suitable choice of the network control policy. can be characterized as follows.
Let I
E denote a subset of links in the network. We will call I a conflict-
free link set if no link within I interferes with any other link in I . Let I denote
the set of all conflict-free link sets within the network. Suppose the network con-
troller chooses to activate the set of links I 2 I during a particular slot; then, the
5 Cross-Layer Issues in Wireless Networks 87
link transmission rates during this slot are given by the vector .I /, each of whose
components correspond to a specific link in the network; further, the component
.e/ D 0 if e … I and .e/ D c.e/ if e 2 I . The set can now be described as
follows:
:
D Conv.f.I / j I 2 I g/
where Conv.A / denotes the convex hull Pof set A . Specifically, Conv.A / is defined
as the set of all convex combinations i pi ai of elements ai 2 A , where the
co-efficients pi are non-negative and add up to one.
Intuitively, we may think of as the set of all long-term link transmission
rate matrices G which are achievable through a randomized control P policy and
vice versa. Specifically, consider any matrix G such that P G D j pj .Ij /,
where the Ij ’s belong to I , pj ’s are non-negative and j pj D 1. The random-
ized control policy, which selects a single conflict-free link set Ij during each slot
at random with probability equal to pj , achieves the rate matrix G in expectation.
The above discussions lead to the following characterization of the network layer
capacity region, which is proved formally in [61, 64]. Let i.u;v/ .t/ denote the data
rate allocated to commodity i during slot t across the link .u; v/ by the network
controller.
Theorem 5.2. The connection rate vector hi i is within the network layer capacity
region if and only if there exists a randomized network control algorithm that
makes valid i.u;v/ .t/ decisions, and yields
2 3
X X
8i; E 4 i.si ;v/ .t/ i.u;si / .t/5 D i ;
vW.si ;v/2E uW.u;si /2E
2 3
X X
8i; 8w … fsi ; ti g; E 4 i.w;v/ .t/ i.u;w/ .t/5 D 0:
vW.w;v/2E uW.u;w/2E
Observe that although the exogenous arrival processes are assumed to be ad-
missible in Theorem 5.2, the capacity region captures all possible network control
strategies, including those that result in non-admissible arrival processes at the indi-
vidual network queues.
We now present the dynamic back-pressure algorithm. At every slot t, the network
controller observes the queue backlog matrix U.t/ D hUvi .t/i, where Uvi .t/ de-
notes the amount of connection i data backlogged at node v, and performs the
following action. For each link .v; w/, let i.v;w/ .t/ denote the connection which
maximizes the differential backlog (with ties broken arbitrarily): i.e., i.v;w/ .t/ D
i .t / i .t /
arg maxi fUvi .t/ Uwi .t/g. Let W.v;w/ .t/ D Uv .v;w/ .t/ Uw.v;w/ .t/ denote the
88 V.S. Anil Kumar et al.
Further, we have
" #
XX X X XX
Uvi .t/ i.v;w/ .t/ i.u;v/ .t/ D i.u;v/ .t/ŒUui .t/ Uvi .t/
i v w u .u;v/ i
Combining the simple identity above with Equation (5.14) yields the following
important property of the back-pressure algorithm:
" # " #
XX X X XX X X
Uvi .t / Q i.v;w/ .t /
Q i.u;v/ .t /
Uvi .t / i.v;w/ .t / i.u;v/ .t /
v i w u v i w u
(5.15)
5.5.3 Analysis
We begin our analysis of the back-pressure algorithm with a brief introduction to the
Lyapunov function framework. The Lyapunov function framework is an important
queuing theoretic tool for proving stability results for networks and for designing
5 Cross-Layer Issues in Wireless Networks 89
stable network control algorithms [8,58,76]. The essential idea in this framework is
to define the Lyapunov function, a non-negative function that is an aggregate mea-
sure of the lengths of all the queues within the system. The resource allocation
choices made by the network controller are evaluated in terms of how they affect
the Lyapunov function over time. The specific function which we will use in our
analysis is defined as follows. Recall that U.t/ denotes the queue backlog matrix
whose rows correspond to network nodes, columns correspond to connections, and
the .u; i /th entry in the matrix corresponds to the backlog at node u for connection i .
We define the Lyapunov function L.U.t// as follows:
XX
L.U.t// D .Uvi .t//2 : (5.16)
i v
Note that L.U.t// D 0 only when all the queue backlogs are zero and L.U.t//
is large when one or more components in the backlog matrix is large. The following
theorem, which is proved in [33], holds for the specific Lyapunov function described
above. We note that claims similar to the following theorem can be made for a
broader class of Lyapunov functions as well. Intuitively, the theorem ensures that
the Lyapunov function incurs a non-negative decrement whenever the sum of queue
backlogs is sufficiently large.
Theorem 5.3. If there exist constants B > 0 and > 0 such that for all slots t:
XX
EŒL.U.t C 1// L.U.t// j U.t/ B Uvi .t/ (5.17)
v i
V 2 U 2 C 2 C A2 2U. A/ (5.18)
Summing the above over all indices .v; i / (say, N in number) and combining it
with the fact that sum of squares of non-negative real numbers is less than or equal
to the square of the sum, it follows that
!
XX X X
L.U.t C 1// L.U.t// 2BN 2 Uvi .t/ i.v;w/.t/ Aiv .t/ i.u;v/ .t/
v i w u
: 1 P
where B D 2N v Œ.maxw .v; w//2 C .maxi Ai C maxu .u; v//2 . Taking con-
ditional expectation yields the following
X
EŒL.U.t C 1// L.U.t// j U.t/ 2BN C 2 Usii .t/ EŒAisi .t/ j U.t/
i
hX X X X i
2E U i .t/ i .t/ .u;v/ .t/ j U.t/ (5.19)
v i v w .v;w/ .u;v/
Since the arrivals are i.i.d. over slots, we have EŒAisi .t/ j U.t/ D i for all
commodities i . Hence, we can rewrite Equation (5.19) as
X
EŒL.U.t C 1// L.U.t// j U.t/ 2BN C 2 Usii .t/i
i
hX X X X i
2E U i .t/ i .t/ .u;v/ .t/ j U.t/ (5.20)
v i v w .v;w/ .u;v/
Combing Eqn. (5.21) with Theorem 5.3 proves the stability of the back-pressure
algorithm. t
u
We now discuss some of the main threads of research in the area of cross-layer
capacity estimation and throughput maximization in wireless networks.
Over the last few years, the capacity of random wireless ad hoc networks has been a
subject of active research. A key result in this area is that of Gupta and Kumar [36],
who studied how the total throughput capacity of an ad hoc wireless network formed
by n nodes distributed randomly inpthe unit square scales with n – as discussed in
Section 5.3, this quantity scales as n= log n in the protocol model of interference.
An important consequence of this result is that the throughput capacity of a wireless
network does not scale linearly with the system size unlike wireline networks, where
the capacity can be increased by adding more network elements. We discussed this
result in detail in Section 5.3 of this chapter. This result has been extended in a
number of directions, building on the techniques introduced in [36].
For the physical (or SINR) model p of interference, Agarwal and Kumar [1] show
that the throughput capacity is
. n/. Several papers have shown that the results
of [36] can be obtained by significantly simpler techniques [47, 66, 83] – in par-
ticular, the upper bound of [36] on the throughput capacity can be obtained much
more easily by just examining the minimum multicut separating the connection end-
points. The capacity of networks with directional antennas has been explored in
[66, 83]. These results have shown that directional antennas increase the capacity
under omnidirectional antennas roughly by a factor of p1˛ˇ , where ˛ and ˇ are the
92 V.S. Anil Kumar et al.
beam widths for the transmitter and receiver, respectively; therefore, the throughput
capacity can increase significantly only when the beam widths become very small,
at which point the network starts behaving like a wired network.
A natural way to augment the capacity of wireless networks is to consider hybrid
networks containing base stations connected by high capacity wired links. Liu et al.
[57] show
p that in this model, there is a significant improvement in the capacity only
if ˝. n/ hybrid nodes are added. See [1, 34, 45] for further work in this direction.
When nodes are mobile, the estimation of the capacity becomes even more
non-trivial. In a surprising result, Grossglauser and Tse [35] show that in fact the
throughput capacity can be increased significantly when nodes are mobile by a novel
packet relaying scheme; see [15, 31, 34, 60, 78] on other results in this direction.
A central issue in the research on characterizing the rate region is that of stability.
The techniques for proving stability are closely related to the nature of arrival pro-
cesses, and this gives a natural way of classifying the literature on this area. Three
broad classes of arrival processes have been studied – constant bit rate, adversar-
ial processes, and admissible stochastic arrival processes (as discussed in Section
5.5). The first two processes have been usually studied using multicommodity flow
techniques, while control theoretic techniques have been used for the third class of
processes.
There has been much research on determining the optimal rates to maximize
throughput via linear programming (LP) formulations, e.g., [4, 12, 18, 37, 39]. The
first attempts can be traced back to Hajek and Sasaki [37], and to Baker et al.
[12]. Jain et al. [39] propose LP-formulations for max-flow and related problems
in a wireless network; in fact, they formulate their constraints in terms of arbi-
trary conflict graphs which can incorporate any interference model. Kodialam and
Nandagopal [42] propose similar LP-formulations and a scheduling algorithm for
determining the maximum transmission rates for a given network with primary
interference models. This is extended to incorporate secondary interference and
non-uniform power levels in [49], as described in Section 5.4. Buraagohain et al.
[18] show that for the Tx-Rx model, the bounds of [49] can be improved by a fac-
tor of 2 by a more careful ordering of the edges during scheduling. Alicherry et al.
[4] extend these techniques to the case of multiradio wireless networks. Chafekar
et al. [19] extend the inductive scheduling technique to the case where interfer-
ence is modeled by SINR constraints, and obtain a logarithmic approximation to the
5 Cross-Layer Issues in Wireless Networks 93
throughput capacity in this model. There is also interest in estimating the capacity
for other protocol models, and Chafekar et al. [20] study this question for the ran-
dom access model.
The Lyapunov drift technique discussed in the analysis of the Dynamic Back-
pressure algorithm in Section 5.5.2 is one of the most commonly employed tech-
niques for proving stability. There is a large body of work that extends [76] – Lin and
Shroff [55] show that under primary interference constraints, this approach guaran-
tees 50% of the throughput capacity; Chaporkar et al. [21] extend this to the case
of secondary interference. Zussman et al. [84] identify a large class of graphs that
satisfy “Local Pooling” conditions, for which the algorithm of [76] achieves the
optimal throughput.
The back-pressure technique of [76] combines routing and scheduling decisions
at the packet level. There has been a lot of work on extending this to layered cross-
layer aware protocols, along with other objectives such as fairness or general utility
maximization (see, e.g., [22, 23, 25, 54, 59, 81]) – these papers use convex optimiza-
tion techniques and interpret lagrangian multipliers as layer specific quantities. In
addition to the Lyapunov analysis technique, methods based on fluid models (e.g.,
[30, 72]) have also been used for stability analysis in wireless networks.
Simple stochastic arrival processes are not adequate for some applications involving
very bursty traffic. Borodin et al. [17] develop a new model of arrival processes
called an Adversarial Queuing Model. Here, it is convenient to think of the packet
arrivals to be determined by an adversary. Let ce .t/ denote the capacity of edge e at
time slot t; ce .t/ can either vary according to a stochastic model, or be chosen by the
adversary. The adversary determines the time at which a packet p is injected, its size
`p , its source sp , and destination dp . Let I Œt; t 0 denote the set of packets injected
by the adversary during the time slots t; : : : ; t 0 . Clearly, an unrestricted adversary
can completely flood the network, making it impossible to achieve stability. This
motivates the following notion of a restricted adversary.
Definition 5.5. ([3, 6]). An adversary injecting packets is said to be an A.w; / ad-
versary for some 0 and some integer w 1 if the following condition holds for
every time step t: the adversary can associate a simple path p from sp to dp for
each p 2 I Œt; t C w 1 so that for each edge e 2 E
X X t Cw1
X
`p .1 / ce .t 0 /
p2I Œt;t Cw1 e2p t 0 Dt
94 V.S. Anil Kumar et al.
With the exception of [6], all other work on adversarial queueing models [3,5,7,9,
17] has focused on static networks, in which ce .t/ is not a function of time. Andrews
et al. [5] showed that several well-known simple greedy queuing protocols, such as
Farthest-to-Go (FTG), Longest-in-System (LIS), and Shortest-in-System (SIS), are
stable, but can have exponentially sized queues; they develop a randomized pro-
tocol with polynomially bounded queues. Aiello et al. [3] show that the dynamic
back-pressure algorithm discussed in Section 5.5.2 is stable. In fact, this algorithm
was developed independently by Awerbuch and Leighton [10, 11] for computing
multicommodity flows around the same time as Tassiulas and Ephremides [76].
Andrews et al. [6] show that in fact the same algorithm, which they refer to as
MAX-WEIGHT, is stable even for the adversarially controlled traffic and network
model of Definition 5.5.
Despite the enormous strides in the throughput capacity problem under several
general settings, many significant open questions remain. Most of the results
yielding polynomial time algorithms for approximating the throughput capacity
in arbitrary networks assume simplified conflict-graph based interference mod-
els, and extending this to more realistic interference models, e.g., those based on
SINR/generalized-SINR constraints is an important class of problems. It would
seem that a first step towards this goal would be to extend the results of Section
5.4 to stochastic and adversarial packet arrival processes. SINR based models do
not have the feature of simple spatial decomposition, making them much harder to
analyze.
An important direction of future research in throughput capacity estimation is to
predict the capacity of a realistic wireless network in which communication proto-
cols have been plugged into certain layers of the protocol stack and other layers
are open to design. This would require extending the current results to the set-
ting where common routing and scheduling protocols are abstracted in a realistic
manner. One step in this direction is the work on estimating the network capac-
ity under random access MAC protocols, e.g., [20, 80]. Further, there is limited
work on combining power control along with aspects of the other layers, because
these formulations become non-convex. Developing techniques to jointly optimize
the overall throughput capacity, along with the total power usage, is an interesting
direction.
Finally, an important step in translating the algorithmic research on achieving
the rate region into practical protocols would be to develop realistic distributed al-
gorithms. Most current cross-layer algorithms which have provably good guarantees
are either centralized, or require a significant amount of information sharing, making
them expensive to implement in a distributed manner. Further, existing distributed
algorithms do not suitably exploit the broadcast nature of wireless medium, and
do not provide useful non-trivial guarantees on their time and message complexity,
which lends tremendous scope for future investigation.
5 Cross-Layer Issues in Wireless Networks 95
Acknowledgements V.S. Anil Kumar and Madhav Marathe are partially supported by the follow-
ing grants from NSF: CNS-0626964, SES-0729441, CNS-0831633, and CNS CAREER 0845700.
References
1. A. Agarwal and P. Kumar. Capacity bounds for ad hoc and hybrid wireless networks. In ACM
SIGCOMM Computer Communications Review, volume 34(3), 2004.
2. R. Ahuja, R. Magnanti, and J. Orlin. Network Flows: Theory, Algorithms, and Applications.
Prentice Hall, 1993.
3. W. Aiello, E. Kushilevitz, R. Ostrovsky, and A. Rosén. Adaptive packet routing for bursty
adversarial traffic. J. Comput. Syst. Sci., 60(3):482–509, 2000.
4. M. Alicherry, R. Bhatia, and L. E. Li. Joint channel assignment and routing for throughput
optimization in multi-radio wireless mesh networks. In MobiCom ’05: Proceedings of the
11th annual international conference on Mobile computing and networking, pages 58–72, New
York, NY, USA, 2005. ACM Press.
5. M. Andrews, B. Awerbuch, A. Fernández, F. T. Leighton, Z. Liu, and J. M. Kleinberg.
Universal-stability results and performance bounds for greedy contention-resolution protocols.
J. ACM, 48(1):39–69, 2001.
6. M. Andrews, K. Jung, and A. Stolyar. Stability of the max-weight routing and scheduling
protocol in dynamic networks and at critical loads. In STOC, pages 145–154, 2007.
7. E. Anshelevich, D. Kempe, and J. M. Kleinberg. Stability of load balancing algorithms in
dynamic adversarial systems. In STOC, pages 399–406, 2002.
8. S. Asmussen. Applied Probability and Queues. New York: Springer-Verlag, 2 edition, 1993.
9. B. Awerbuch, P. Berenbrink, A. Brinkmann, and C. Scheideler. Simple routing strategies for
adversarial systems. In FOCS, pages 158–167, 2001.
10. B. Awerbuch and F. T. Leighton. A simple local-control approximation algorithm for multi-
commodity flow. In FOCS, pages 459–468, 1993.
11. B. Awerbuch and T. Leighton. Improved approximation algorithms for the multi-commodity
flow problem and local competitive routing in dynamic networks. In STOC, pages 487–496,
1994.
12. D. J. Baker, J. E. Wieselthier, and A. Ephremides. A distributed algorithm for scheduling the
activation of links in self-organizing mobile radio networks. In IEEE Int. Conference Commu-
nications, pages 2F6.1–2F6.5, 1982.
13. H. Balakrishnan, C. Barrett, A. Kumar, M. Marathe, and S. Thite. The Distance 2-Matching
Problem and Its Relationship to the MAC Layer Capacity of Adhoc Wireless Networks. special
issue of IEEE J. Selected Areas in Communications, 22(6):1069–1079, 2004.
14. N. Bansal and Z. Liu. Capacity, Delay and Mobility in Wireless Ad-Hoc Networks. In IEEE
INFOCOM 2003, San Francisco, CA, April 1–3 2003.
15. N. Bansal and Z. Liu. Capacity, Delay and Mobility in Wireless Ad-Hoc Networks. In IEEE
INFOCOM 2003, San Francisco, CA, April 1–3 2003.
16. C. L. Barrett, A. Marathe, M. V. Marathe, and M. Drozda. Characterizing the interaction be-
tween routing and mac protocols in ad-hoc networks. In MobiHoc, pages 92–103, 2002.
17. A. Borodin, J. M. Kleinberg, P. Raghavan, M. Sudan, and D. P. Williamson. Adversarial queu-
ing theory. J. ACM, 48(1):13–38, 2001.
18. C. Buraagohain, S. Suri, C. Tóth, and Y. Zhou. Improved throughput bounds for interference-
aware routing in wireless networks. In Proc. Computing and Combinatorics Conference, 2007.
19. D. Chafekar, V. A. Kumar, M. Marathe, S. Parthasarathy, and A. Srinivasan. Approximation
algorithms for computing of wireless networks with sinr constraints. In Proc. of IEEE INFO-
COM, 2008.
20. D. Chafekar, D. Levin, V. A. Kumar, M. Marathe, S. Parthasarathy, and A. Srinivasan. Capacity
of asynchronous random-access scheduling in wireless networks. In Proc. of IEEE INFOCOM,
2008.
96 V.S. Anil Kumar et al.
21. P. Chaporkar, K. Kar, and S. Sarkar. Throughput guarantees through maximal scheduling in
multi-hop wireless networks. In Proceedings of 43rd Annual Allerton Conference on Commu-
nications, Control, and Computing, 2005.
22. L. Chen, S. H. Low, M. Chiang, and J. C. Doyle. Cross-layer congestion control, routing and
scheduling design in ad hoc wireless networks. In INFOCOM, pages 1–13, 2006.
23. L. Chen, S. H. Low, and J. C. Doyle. Joint congestion control and media access control design
for ad hoc wireless networks. In INFOCOM, pages 2212–2222, 2005.
24. H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of
observations. Annals of Mathematical Statistics, 23:493–509, 1952.
25. M. Chiang, S. H. Low, A. R. Calderbank, and J. C. Doyle. Layering as optimization decom-
position: A mathematical theory of network architectures. Proceedings of the IEEE, 95(1):
255–312, 2007.
26. S. Cui and A. J. Goldsmith. Cross-layer design of energy-constrained networks using cooper-
ative mimo techniques. Signal Processing, 86(8):1804–1814, 2006.
27. S. Cui, R. Madan, A. J. Goldsmith, and S. Lall. Cross-layer energy and delay optimiza-
tion in small-scale sensor networks. IEEE Transactions on Wireless Communications, 6(10):
3688–3699, 2007.
28. D. S. J. De Couto, D. Aguayo, J. Bicket, and R. Morris. A High-Throughput Path Metric for
Multi-Hop Wireless Routing. In Proceedings of Mobicom, pages 134–146. ACM Press, 2003.
29. R. Draves, J. Padhye, and B. Zill. Routing in multi-radio, multi-hop wireless mesh networks. In
MobiCom ’04: Proceedings of the 10th annual international conference on Mobile computing
and networking, pages 114–128, 2004.
30. A. Eryilmaz and R. Srikant. Fair resource allocation in wireless networks using queue-length-
based scheduling and congestion control. In Proc. of IEEE INFOCOM, 2005.
31. A. E. Gamal, J. P. Mammen, B. Prabhakar, and D. Shah. Throughput-delay trade-off in wireless
networks. In IEEE INFOCOM, 2004.
32. M. Gastpar and M. Vetterli. On The Capacity of Wireless Networks: The Relay Case. In IEEE
INFOCOM 2002, New York, NY, June 23–27 2002.
33. L. Georgiadis, M. J. Neely, and L. Tassiulas. Resource allocation and cross-layer control in
wireless networks. Found. Trends Netw., 1(1):1–144, 2006.
34. M. Gerla, B. Zhou, Y.-Z. Lee, F. Soldo, U. Lee, and G. Marfia. Vehicular grid communica-
tions: the role of the internet infrastructure. In WICON ’06: Proceedings of the 2nd annual
international workshop on Wireless internet, page 19, New York, NY, USA, 2006. ACM.
35. M. Grossglauser and D. N. C. Tse. Mobility increases the capacity of ad hoc wireless networks.
IEEE/ACM Trans. Netw., 10(4):477–486, 2002.
36. P. Gupta and P. R. Kumar. The Capacity of Wireless Networks. IEEE Transactions on Infor-
mation Theory, 46(2):388–404, 2000.
37. B. Hajek and G. Sasaki. Link scheduling in polynomial time. IEEE Transactions on Informa-
tion Theory, 34:910–917, 1988.
38. W. Hoeffding. Probability inequalities for sums of bounded random variables. American Sta-
tistical Association Journal, 58:13–30, 1963.
39. K. Jain, J. Padhye, V. N. Padmanabhan, and L. Qiu. Impact of interference on multi-hop wire-
less network performance. In Proceedings of the 9th annual international conference on Mobile
computing and networking, pages 66–80. ACM Press, 2003.
40. M. Kaufmann, J. F. Sibeyn, and T. Suel. Derandomizing algorithms for routing and sorting
on meshes. In SODA ’94: Proceedings of the fifth annual ACM-SIAM symposium on Discrete
algorithms, pages 669–679, Philadelphia, PA, USA, 1994. Society for Industrial and Applied
Mathematics.
41. S. A. Khayam, S. Karande, M. Krappel, and H. Radha. Cross-layer protocol design for real-
time multimedia applications over 802.11 b networks. In ICME ’03: Proceedings of the 2003
International Conference on Multimedia and Expo, pages 425–428, 2003.
42. M. Kodialam and T. Nandagopal. Characterizing achievable rates in multi-hop wireless net-
works: the joint routing and scheduling problem. In Proceedings of the 9th annual international
conference on Mobile computing and networking, pages 42–54. ACM Press, 2003.
5 Cross-Layer Issues in Wireless Networks 97
43. M. Kodialam and T. Nandagopal. Characterizing thanks, rates in multi-hop wireless mesh net-
works with orthogonal channels. IEEE/ACM Trans. Netw., 13(4):868–880, 2005.
44. M. Kodialam and T. Nandagopal. Characterizing the capacity region in multi-radio multi-
channel wireless mesh networks. In MobiCom ’05: Proceedings of the 11th annual interna-
tional conference on Mobile computing and networking, pages 73–87, New York, NY, USA,
2005. ACM Press.
45. U. Kozat and L. Tassiulas. Throughput capacity in random ad-hoc networks with infrastruc-
ture support. In Proc. 9th Annual ACM International Conference on Mobile computing and
networking, 2003.
46. U. C. Kozat and L. Tassiulas. Throughput capacity of random ad hoc networks with infras-
tructure support. In MobiCom ’03: Proceedings of the 9th annual international conference on
Mobile computing and networking, pages 55–65, New York, NY, USA, 2003. ACM Press.
47. S. R. Kulkarni and P. Viswanath. A deterministic approach to throughput scaling in wireless
networks. IEEE Transactions on Information Theory, 50(6):1041–1049, 2004.
48. V. S. A. Kumar, M. V. Marathe, S. Parthasarathy, and A. Srinivasan. End-to-end packet-
scheduling in wireless ad-hoc networks. In SODA ’04: Proceedings of the fifteenth annual
ACM-SIAM symposium on Discrete algorithms, pages 1021–1030, Philadelphia, PA, USA,
2004. Society for Industrial and Applied Mathematics.
49. V. S. A. Kumar, M. V. Marathe, S. Parthasarathy, and A. Srinivasan. Algorithmic aspects of
capacity in wireless networks. In SIGMETRICS ’05: Proceedings of the 2005 ACM SIGMET-
RICS International Conference on Measurement and Modeling of Computer Systems, pages
133–144, 2005.
50. V. S. A. Kumar, M. V. Marathe, S. Parthasarathy, and A. Srinivasan. Provable algorithms for
joint optimization of transport, routing and mac layers in wireless ad hoc networks. In Proc.
DialM-POMC Workshop on Foundations of Mobile Computing, 2007. Eight pages.
51. V. S. A. Kumar, M. V. Marathe, S. Parthasarathy, and A. Srinivasan. Throughput maximization
in multi-channel multi-radio wireless networks. Technical report, Virginia Tech, 2009.
52. M. Kunde. Block gossiping on grids and tori: Deterministic sorting and routing match the
bisection bound. In ESA ’93: Proceedings of the First Annual European Symposium on Algo-
rithms, pages 272–283, London, UK, 1993. Springer-Verlag.
53. Q. Liang, D. Yuan, Y. Wang, and H.-H. Chen. A cross-layer transmission scheduling scheme
for wireless sensor networks. Comput. Commun., 30(14-15):2987–2994, 2007.
54. X. Lin and S. Rasool. A distributed and provably efficient joint channel-assignment, scheduling
and routing algorithm for multi-channel multi-radio wireless mesh networks. In Proc. of IEEE
INFOCOM, 2007.
55. X. Lin and N. B. Shroff. The impact of imperfect scheduling on cross-layer congestion control
in wireless networks. IEEE/ACM Trans. Netw., 14(2):302–315, 2006.
56. X. Lin, N. B. Shroff, and R. Srikant. A tutorial on cross-layer optimization in wireless net-
works. IEEE Journal on Selected Areas in Communications, 24(8):1452–1463, 2006.
57. B. Liu, Z. Liu, and D. Towsley. On the Capacity of Hybrid Wireless Networks. In IEEE INFO-
COM 2003, San Francisco, CA, April 1–3 2003.
58. S. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London,
1993.
59. H. Nama, M. Chiang, and N. Mandayam. Utility lifetime tradeoff in self regulating wireless
sensor networks: A cross-layer design approach. In IEEE ICC, jun 2006.
60. M. Neely and E. Modiano. Capacity and delay tradeoffs for ad hoc mobile networks. IEEE
Transactions on Information Theory, 51(6):1917–1937, 2005.
61. M. J. Neely. Dynamic power allocation and routing for satellite and wireless networks with
time varying channels. PhD thesis, Massachusetts Institute of Technology, LIDS, 2003.
62. M. J. Neely, E. Modiano, and C.-P. Li. Fairness and optimal stochastic control for heteroge-
neous networks. In INFOCOM, pages 1723–1734, 2005.
63. M. J. Neely, E. Modiano, and C. E. Rohrs. Tradeoffs in delay guarantees and computation
complexity for n n packet switches. In Conference on Information Sciences and Systems,
2002.
98 V.S. Anil Kumar et al.
64. M. J. Neely, E. Modiano, and C. E. Rohrs. Dynamic power allocation and routing for time-
varying wireless networks. IEEE Journal on Selected Areas in Communications, 23(1):89–103,
2005.
65. S. Parthasarathy. Resource Allocation in Networked and Distributed Environments. PhD thesis,
Department of Computer Science, University of Maryland at College Park, 2006.
66. C. Peraki and S. D. Servetto. On the maximum stable throughput problem in random networks
with directional antennas. In MobiHoc ’03: Proceedings of the 4th ACM international sympo-
sium on Mobile ad hoc networking & computing, pages 76–87, New York, NY, USA, 2003.
ACM Press.
67. A. Rajeswaran and R. Negi. Capacity of power constrained ad-hoc networks. In INFOCOM,
2004.
68. S. Ramanathan and E. L. Lloyd. Scheduling algorithms for multihop radio networks. IEEE-
ACM Transactions on Networking (ToN), 1:166–177, 1993.
69. A. Schrijver. Theory of Linear and Integer Programming. Wiley, 2001.
70. E. Setton, T. Yoo, X. Zhu, A. Goldsmith, and B. Girod. Cross-layer design of ad hoc networks
for real-time video streaming. Wireless Communications, IEEE, 12(4):59–65, 2005.
71. G. Sharma, R. R. Mazumdar, and N. B. Shroff. On the complexity of scheduling in wireless
networks. In MobiCom ’06: Proceedings of the 12th annual international conference on Mobile
computing and networking, pages 227–238, New York, NY, USA, 2006. ACM Press.
72. A. Stolyar. Maximizing queueing network utility subject to stability: Greedy primal-dual
algorithm. Queueing Systems, 50:401–457, 2005.
73. A. L. Stolyar. Maxweight scheduling in a generalized switch: State space collapse and work-
load minimization in heavy traffic. Annals of Applied Probability, 14(1):1–53, 2004.
74. L. Tassiulas. Dynamic link activation scheduling in multihop radio networks with fixed or
changing topology. PhD thesis, University of Maryland, College Park, 1991.
75. L. Tassiulas. Scheduling and performance limits of networks with constantly changing topol-
ogy. IEEE Transactions on Information Theory, 43(3):1067–1073, 1997.
76. L. Tassiulas and A. Ephremides. Stability properties of constrained queueing systems and
scheduling policies for maximum throughput in multihop radio networks. IEEE Trans. Aut.
Contr., 37:1936–1948, 1992.
77. Y. Tian and E. Ekici. Cross-layer collaborative in-network processing in multihop wireless
sensor networks. IEEE Transactions on Mobile Computing, 6(3):297–310, 2007.
78. S. Toumpis and A. Goldsmith. Large wireless networks under fading, mobility, and delay con-
straints. In Proc. of the IEEE INFOCOM, 2004.
79. W. Wang, X.-Y. Li, O. Frieder, Y. Wang, and W.-Z. Song. Efficient interference-aware TDMA
link scheduling for static wireless networks. In MobiCom ’06: Proceedings of the 12th annual
international conference on Mobile computing and networking, pages 262–273, New York,
NY, USA, 2006. ACM Press.
80. X. Wang and K. Kar. Cross-layer rate optimization for proportional fairness in multihop
wireless networks with random access. IEEE Journal on Selected Areas in Communications,
24(8):1548–1559, 2006.
81. X. Wu and R. Srikant. Regulated maximal matching: A distributed scheduling algorithm for
multi-hop wireless networks with node-exclusive spectrum sharing. In IEEE Conf. on Decision
and Control, 2005.
82. Y. Yang, J. Wang, and R. Kravets. Designing routing metrics for mesh networks. In First IEEE
Workshop on Wireless Mesh Networks (WiMesh), 2005.
83. S. Yi, Y. Pei, and S. Kalyanaraman. On the capacity improvement of ad hoc wireless networks
using directional antennas. In Proceedings of the 4th ACM International Symposium on Mobile
Ad Hoc Networking and Computing (MobiHoc), pages 108–116, 2003.
84. G. Zussman, A. Brzezinski, and E. Modiano. Multihop local pooling for distributed throughput
maximization in wireless networks. In Proc. of IEEE INFOCOM, 2008.
Chapter 6
Resource Allocation Algorithms for the Next
Generation Cellular Networks
6.1 Introduction
The forthcoming fourth generation (4G) cellular systems are expected to provide a
wide variety of new services, from high quality voice and high-definition video to
very high bit rate data wireless channels. With the rapid development of wireless
communication networks, it is expected that fourth-generation mobile systems will
be launched around 2012–2015. 4G mobile systems focus on seamlessly integrating
the existing wireless technologies including GSM, wireless LAN, and Bluetooth.
D. Amzallag ()
British Telecommunications plc,
81 Newgate Street, London EC1A 7AJ,
e-mail: [email protected]
D. Raz
Computer Science Department,
Technion – Israel Institute of Technology,
Haifa 32000, Israel,
e-mail: [email protected]
1
We assume that the reader is familiar with the most well-known notions in cellular networks. An
excellent introduction can be found in [36]
6 Resource Allocation Algorithms for the Next Generation Cellular Networks 101
BS1
BS4
BS2
BS3
when a modification to a current network is made, but also (and mainly) when there
are changes in the traffic demand, even within a small local area (e.g., building a
new mall in the neighborhood or opening new highways). Cell planning that is able
to respond to local traffic changes and/or to make use of advanced technological
features at the planning stage is essential for cost-effective design of future systems.
Figure 6.1 depicts a typical example of a cell planning instance. Each square in
the picture represents either a single mobile station (associated with its bandwidth
demand) or a cluster of mobile stations (associated with their aggregated demand).
The dotted circles represent potential places to locate base stations (with their corre-
sponding installation/opening costs), and the corresponding drawn patterns are the
coverage area of the configuration (depending on the height, azimuth, tilt, antenna
type, etc.). The outcome of the minimum-cost cell planning problem (CPP) is a
minimum-cost subset of places for positioning base stations (in the figure BS1–BS4)
that covers all the demand of the mobile stations. Mobile stations in an overlapped
area (e.g., the marked mobile stations in the overlapped area of BS1 and BS2) can
be served (or satisfied) by each of these base stations. A mutual interference oc-
curs between the overlapped base-stations (exact type of interference depends on
the deployed technology) as described later in this chapter.
This section studies algorithmic aspects of cell planning problems, incorpo-
rates the anticipated future technologies into the cell planning, and presents new
methods for solving these problems. These techniques are based on novel model-
ing of technology dependent characterizations and approximation algorithms that
104 D. Amzallag and D. Raz
provide provable good solutions. Clearly, methods presented in this chapter are also
applicable to current networks and various radio technologies.
As introduced earlier, future systems will be designed to offer very high bit rates
with a high frequency bandwidth. Such high frequencies yield a very strong signal
degradation and suffer from significant diffraction resulting from small obstacles,
hence forcing the reduction of cell size (in order to decrease the amount of degra-
dation and to increase coverage), resulting in a significantly larger number of cells
in comparison to previous generations. Future systems will have cells of different
sizes: picocells (e.g., an in-building small base station with antenna on the ceiling),
microcells (e.g., urban street, up to 1 km long with base stations above rooftops at
25 m height), and macrocells (e.g., non-line-of-sight urban macro-cellular environ-
ment). Each such cell is expected to service users with different mobility patterns,
possibly via different radio technologies. Picocells can serve slow mobility users
with relatively high traffic demands. They can provide high capacity coverage with
hot-spot areas coverage producing local solutions for these areas. Even though these
cells do not have a big RF impact on other parts of the network, they should be taken
into consideration during the cell planning stage since covering hot-spot areas may
change the traffic distribution. At the same time, microcells and macrocells can be
used to serve users with high mobility patterns (highway users) and to cover larger
areas. Hence, it is important to be able to choose appropriate locations for poten-
tial base stations and to consider different radio technologies, in order to achieve
maximum coverage (with low interference) at a minimum cost.
The increased number of base stations, and the variable bandwidth demand of
mobile stations, will force operators to optimize the way the capacity of a base
station is utilized. Unlike in previous generations, the ability of a base station to
successfully satisfy the service demand of all its mobile stations will be highly lim-
ited and will mostly depend on its infrastructure restrictions, as well as on the service
distribution of its mobile stations. To the best of our knowledge, no cell planning ap-
proach, known today, is taking the base station (“bandwidth”) capacity into account.
Base stations and mobile terminals are expected to make substantial use of
adaptive antennas and smart antennas. In case the system will have the ability
to distinguish between different users (by their RF positions, or by their channel
estimation), adaptive antennas will point a narrow lobe to each user, reducing inter-
ference while at the same time, maintaining high capacity. Smart antenna systems
combine an antenna array with a digital signal-processing capability, enabling base
stations to transmit and receive in an adaptive, spatially sensitive manner. In other
words, such a system can automatically change the directionality of its radiation
patterns in response to its signal environment. This can dramatically increase the
performance characteristics (such as capacity) of a wireless system. Hence, future
6 Resource Allocation Algorithms for the Next Generation Cellular Networks 105
methods for cell planning should be able to include a deployment of smart anten-
nas and adaptive antennas in their optimization process. Note that current advanced
tools for cell planning already contain capabilities for electrical modifications of tilt
and azimuth.
The theoretical models for cell planning are closely related to a family of combina-
torial optimization problems called facility location problems. The facility location
problem is one of the most well-studied problems in combinatorial optimization
(see [35] for an excellent survey). In the traditional facility location problem we
wish to find optimal locations for facilities (or base stations) in order to serve a
given set of client locations; we are also given a set of locations in which facilities
may be built, where building a facility in location i 2 I incurs a cost of ci ; each
client j 2 J must be assigned to one facility, thereby incurring a cost of cij , propor-
tional to the distance between locations i and j ; the objective is to find a solution of
minimum total (assignment + opening) cost. In the k-median problem, facility costs
are replaced by a constraint that limits the number of facilities to be k and the ob-
jective is to minimize the total assignment costs. These two classical problems are
min-sum problems, in that the sum of the assignment costs goes into the objective
function. The k-center problem is the min–max analogue of the k-median problem:
one builds facilities at k locations out of a given number of locations, so as to min-
imize the maximum distance from a given location to the nearest selected location.
Theoretically speaking, the minimum-cost cell planning problem (CPP) is a
“new” type of discrete location problem. In a cell planning problem, every client
j 2 J has a positive demand dj and every facility (or base station) i 2 I has also
a hard capacity wi and a subset Si
J of clients admissible to be satisfied by it.
Two interesting situations can happen when satisfying the demand of a client. The
first case is when multiple coverage is possible, meaning that several facilities are
allowed to participate in the satisfaction of a single client, while in the second case
clients can be satisfied only by a single facility. In addition, a penalty function is in-
troduced, capturing the “interference” between radio channels of neighboring base
stations (or facilities). In this model, for example, when the demand of a client is
satisfied by two facilities, their “net” contribution is less than or equal the sum of
their supplies. The minimum-cost cell planning problem is to find a subset I 0
I of
minimum cost that satisfies the demands of all the clients (while taking into account
interference for multiple satisfaction).
It is important to note that this new problem is not a special case of any of the
known min-sum discrete location problems (e.g., there is no connection cost be-
tween base stations and clients) nor a “special NP-hard case” of a minimum-cost
flow problem.
106 D. Amzallag and D. Raz
2
Notice that when planning cellular networks, the notion of “clients” sometimes means mobile
stations and sometimes it represents the total traffic demand created by a cluster of mobile stations
at a given location. In this chapter we support both forms of representations.
6 Resource Allocation Algorithms for the Next Generation Cellular Networks 107
different cells (inter-cell). In this section we model interference for the forthcoming
cellular systems and present a new approach of incorporating interference in cell
planning.
Interference is typically modeled, for cell planning purposes, by an interference
matrix which represents the impact of any base station on other base stations, as a
result of simultaneous coverage of the same area (see Appendix 6B in [12]). Next
we generalized this behavior to also include the geographic position of this (simul-
taneous) coverage.
Let P be an m m n matrix of interference, where p.i1 ; i2 ; j / 2 Œ0; 1 rep-
resents the fraction of i1 ’s service which client j loses as a result of interference
with i2 (defining p.i; i; j / D 0 for every i 2 I; j 2 J , and p.i; i 0 ; j / D 0 for
every j … Si 0 )3 . This means that the interference caused as a result of a cover-
age of a client by more than one base station depends on the geographical position
of the related “client” (e.g., in-building coverage produces a different interference
than a coverage on highways using the same set of base stations). As defined above,
Q.i; j / is the contribution of base station i to client j , taking into account the in-
terference from all relevant base stations. We describe here two general models for
computing Q.i; j /.
Let xij be the fraction of the capacity wi of base station i that is supplied to
client j . Recall that I 0
I is the set of base stations selected to be opened, the
contribution of base station i to client j is defined to be
Y
Q.i; j / D wi xij 1 p.i; i 0 ; j / : (6.1)
i 0 ¤i W i 0 2I 0
Notice that, as defined by the above model, it is possible that two distinct base
stations, say ˛ and ˇ, interfere with each other “in” a place j (i.e., p .˛; ˇ; j / > 0)
although j … Sˇ . In general, each of these base stations “interferes” base station i
to service j and reduces the contribution of wi xij by a factor of p.i; i 0 ; j /.
Since (6.1) is a high-order expression we use the following first-order approxi-
mation, while assuming that the p’s are relatively small:
Y
X
1 p.i; i 0 ; j / D 1 p.i; i10 ; j / 1 p.i; i20 ; j / 1 p.i; i 0 ; j /: (6.2)
i 0 2I 0 i 0 2I 0
3
For simplicity, we do not consider here interference of higher order. These can be further derived
and extended from our model.
108 D. Amzallag and D. Raz
Consider, for example, a client j belonging to the coverage areas of two base sta-
tions i1 and i2 , and assume that just one of these base stations, say i1 , is actually
participating in j ’s satisfaction (i.e., xi1 j > 0 but xi2 j D 0). According to the
above model, the mutual interference of i2 on i1 ’s contribution (w1 xi1 j ) should be
considered, although i2 is not involved in the coverage of client j .
In most cellular wireless technologies, this is the usual behavior of interference.
However, in some cases a base station can affect the coverage of a client if and only
if it is participating in its demand satisfaction. The contribution of base station i to
client j in this case is defined by
( P P
wi xij 1 i 0 ¤i 2Ij p.i; i 0 / ; i 0 ¤i 2Ij p.i; i 0 / < 1
Q.i; j / (6.4)
0; otherwise.
where Ij is the set of base stations that participate in the coverage of client j , i.e.,
Ij D fi 2 I W xij > 0g. Notice that in this model the interference function does
not depend on the geographic position of the clients.
At first sight we must admit that the answer is not so encouraging. BCPP is closely
related to the well-known budgeted maximum coverage problem. Given a budget B
and a collection of subsets S of a universe U of elements, where each element in U
6 Resource Allocation Algorithms for the Next Generation Cellular Networks 109
has a specified weight and each subset has a specified cost. The budgeted maximum
coverage problem asks for a subcollection S 0
S of sets, whose total cost is at
most B, such that the total weight of elements covered by S 0 is maximized. This
problem is the “budgeted” version of the set cover problem in which one wishes to
cover all the elements of U using a minimum number of subsets of S . The budgeted
maximum coverage problem is a special case of BCPP in which elements are clients
with unit demand, every set i 2 I corresponds to a base station i containing all
clients in its coverage area Si
J , and wi jSi j for all base stations in I . In
this setting, budgeted maximum coverage is precisely the case (in the sense that a
solution to BCPP is optimal if and only if it is optimal for the budgeted maximum
coverage) when there is no interference (i.e., P is the zero matrix). For the budgeted
maximum coverage problem, there is a .1 1e /-approximation algorithm [1,25], and
this is the best approximation ratio possible unless NPDP [18, 25].
BCPP is also closely related to the budgeted unique coverage version of set cover.
In the budgeted unique coverage problem elements in the universe are uniquely cov-
ered, i.e., appear in exactly one set of S 0 . As with the budgeted maximum coverage
problem, this problem is a special case of BCPP. In this setting, budgeted unique
coverage is when the interference is taken to be the highest. For the budgeted unique
coverage problem, there is an ˝.1= log n/-approximation algorithm [17] and, up to
a constant exponent depending on ", O.1= log n/ is the best possible ratio assuming
"
NP ª BPTIME (2n ) for some " > 0. This means that in case of very strong inter-
ference one might get an arbitrary bad solution (i.e., far from the optimal solution).
Finally, another similar problem to BCPP is the all-or-nothing demand maxi-
mization problem (AoNDM). In this problem, we are given a set of basestations that
already opened (or exists), each has its own capacity, and a subset of clients that
are admissible to be serviced by it. Given a set of clients, each associated with a
demand, we wish to find a way to “share” the base stations’ capacity among the
clients in order to maximize the number of satisfied clients. This problem is not a
special case of BCPP, and the only two differences between these two problems are
that there is no interference in AoNDM, and the identity of the opened base sta-
tions is unknown in BCPP (moreover, this problem is trying to decide which base
stations are the best for opening). “Surprisingly”, AoNDM was showed [3] to be
unable to obtain a reasonable approximation algorithm under standard complexity
assumptions (to be precise, AoNDM is hard to approximate as the maximum inde-
pendent set problem). We will devote Section 6.3 to AoNDM and its important role
in solving resource allocation problem in future cellular networks.
As expected from these similarities, one can prove that it is NP-hard even to find
a feasible solution to BCPP [4, 6].
To this end we use the fact that typically the number of base stations in cellular
networks is much smaller than the number of clients. Moreover, when there is a
relatively large cluster of antennas in a given location, this cluster is usually designed
to meet the traffic requirements of a high-density area of clients. Thus, for both
interpretations of “clients”, the number of satisfied clients is always much bigger
than the number of base stations. Followed by the above discussion, we define the
k4k-budgeted cell planning problem (k4k-BCPP) to be BCPP with the additional
property that every set of k base stations can fully satisfy at least k clients, for every
integer k (and we refer to this property as the “k4k property”). Using this property it
was shown [4, 6] that k4k-BCPP is NP-hard but no longer NP-hard to approximate
(i.e., the problem to approximate the problem within any performance guarantee is
polynomial-time solvable).
In the remainder of this section we assume that the interference model is the one
defined in Equation (6.4).
e1
6.2.8 An 3e1
-Approximation Algorithm
Lemma 6.1. Every solution to the budgeted cell planning problem (or to k4k-
BCPP ) can be transformed to a solution in which the number of clients that are
satisfied by more than one base station is at most the number of opened base
stations. Moreover, this transformation leaves the number of fully satisfied clients
as well as the solution cost unchanged.
a 10 3 8 b 10 5 8
7 5 5 3
11 15 11 15
4 10 6 12
7 i' 12 7 i' 12
10 7 9 10 9 9
base station vertex as the root (in each of the connected components of G ) and
trimming all client leaves. These leaves correspond to clients who are covered, in
the solution, by a single base station. Since the distance, from the root, to every
leaf of each tree is even, the number of internal client vertices is at most the
number of base station vertices, hence jJ 00 j < jI 0 j.
2. Otherwise, we transform G D .I 0 [J 0 ; E/ into an acyclic bipartite graph G0 D
.I 0 [ J 0 ; E 0 / using the following cycle canceling algorithm.
Cycle canceling algorithm As long as there are cycles in G , identify a cy-
cle C and let be the weight of a minimum-weight edge on this cycle ( D 2
in Figure 6.2 (right)). Take a minimum-weight edge on C and, starting from this
edge, alternately, in clockwise order along the cycle, decrease and increase the
weight of every edge by .
Two important invariants are maintained throughout the cycle-canceling proce-
dure. The first is that w0 .i; j / 0 for every edge .i; j / of the cycle. The second is
that there exists at least one edge e D .i; j / on the cycle for which w0 .i; j / D 0.
Therefore the number and the identity of the satisfied clients are preserved and
G0 is also a solution to the problem. Since at each iteration at least one edge is
removed, G0 is acyclic and jJ 00 j < jI 0 j as before.
The main difference between k4k-BCPP and other well-studied optimization
problems is the existence of interferences. Overcoming this difficulty is done using
Lemma 6.1. Although k4k-BCPP is still NP-hard, we show how to approximate it
using the greedy approach, similar to the ideas presented in [25].
Prior to using the greedy approach to solve the k4k-BCPP it turns out that one must
answer the next question: how many clients can be covered by a set S of opened
base stations, and how many more can be covered if an additional base station i is
to be opened next? Formally, for a given set of base stations, I 0 , let N.I 0 / be the
112 D. Amzallag and D. Raz
number of clients that can be satisfied, each by exactly one base station (hence we
assume that there is no interference here, or interference of the second kind). We
refer to the problem of computing N./ as the Client Assignment Problem (CAP).
Algorithmically speaking, at first sight CAP has two important properties [4].
The first (which is straightforward, can be obtained from a reduction from the
PARTITION problem) is that CAP is NP-hard. The second, and the non-intuitive
one, is that the function N./ is not submodular (a set function is submodular if
f .S / C f .T / f .S [ T / C f .S \ T / for all S; T
U ).
To see this, consider the following example: J D f1; 2; 3; 4; 5; 6; 7; 8; 9; 10g,
I D f1; 2; 3g, S1 D J , S2 D f1; 2; 3g, S3 D f4; 5; 6g. The demands are: d1 D
d2 D d3 D d4 D d5 D d6 D 4, d7 D 3, d8 D d9 D d10 D 9, and the capacities
are: w1 D 30, w2 D w3 D 12.
Let S D f1; 2g and T D f1; 3g. One can verify that N.S / D N.T / D 8,
N.S \ T / D 7, and N.S [ T / D 10.
These two properties indicate the difficulty in applying a greedy approach for
solving k4k-BCPP . Informally speaking, submodularity guarantees that a greedy
choice made at some point stays a greedy choice even when taking into account sub-
sequent steps. Without submodularity it is not clear whether greediness is the right
approach. Moreover, the NP-hardness of CAP implies that we cannot efficiently
compute the best (in the greedy sense) base station to open in a single step. These
two difficulties prevent us from using the generalization of [25] proposed by Sviri-
denko [32] to approximate k4k-BCPP , as the algorithm of Sviridenko can be used
to approximate only submodular (that is polynomial-time computable) functions.
In order to overcome these problems, we present Algorithm 2 as an approxi-
mation algorithm for CAP. The algorithm gets as an input an ordered set of base
stations I 0 D fi1 ; : : : ; ik g, and is a 1/2-approximation for CAP, as proven in [4].
Let N.OPT/ denote the value of the optimal solution for the BMAP instance. It
holds that N.OPT/ n1 . For the solution I1 we know that
e1 e1
nQ NA .I1 / N.OPT/ n : (6.5)
2e 2e 1
We get:
3e 1 e1
nQ D nQ C nQ (6.6)
2e 2e
e1
nQ C jI j (6.7)
2e
e1 e1
n C n (6.8)
2e 1 2e 2
e1
D n ; (6.9)
2e
where inequality (6.7) follows from the fact that nQ jI2 j jI j and the k4k
property, and inequality (6.8) is based on (6.5) and Lemma 6.1.
Recall that the minimum-cost cell planning problem (CPP) asks for a subset of base
stations I 0
I of minimum cost that satisfies the demands of all the clients using
the available base station capacity.
6 Resource Allocation Algorithms for the Next Generation Cellular Networks 115
Let zi denote the indicator variable of an opened base station, i.e., zi D 1 if base
station i 2 I is selected for opening, and zi D 0 otherwise. Consider the following
integer program for this problem (IP1 ):
X
min ci zi (IP1 )
i 2I
X
s.t. Q.i; j / dj ; 8 j 2 J; (6.10)
i Wj 2Si
X
xij zi ; 8 i 2 I; (6.11)
j 2J
0 xij 1; 8 i 2 I; j 2 Si ; (6.12)
xij D 0; 8 i 2 I; j … Si ;
zi 2 f0; 1g; 8 i 2 I: (6.13)
In the first set of constraints (6.10), we ensures that the demand dj of every
client j is satisfied, while the second set (6.11) ensure that the ability of every open
base station to satisfy the demands of the clients is limited by its capacity (and that
clients can be satisfied only by opened base stations). The contribution Q.i; j / of
base station i to client j , taking into account interference from other base stations,
can be modeled as in (6.3) or (6.4), or any other predefined behavior of interference.
However, because of the way Q.i; j /’s are computed, the integer program (IP1 )
is not linear when interference exists. Without loss of generality we may assume
that every client in the input has demand at least 1, as the used units can be scaled
accordingly and there is no need for “covering” the clients with zero demand. Lastly,
we use the integrality assumption that the values fwi gi 2I and fdj gj 2J are integers.
When there are no interference, IP1 becomes much simpler. (LP2 ) is its linear
programming relaxation, in which the last set of integrality constraints (6.13) is
relaxed to allow the variables zi to take rational values between 0 and 1:
X
min ci zi (LP2 )
i 2I
X
s.t. wi xij dj ; 8 j 2 J; (6.14)
i 2I
X
xij zi ; 8 i 2 I; (6.15)
j 2J
0 xij 1; 8 i 2 I; j 2 Si ; (6.16)
xij D 0; 8 i 2 I; j … Si ;
0 zi 1; 8 i 2 I: (6.17)
116 D. Amzallag and D. Raz
In fact, LP2 is a minimum-cost flow problem. To see that, consider the network
.G; u; c 0 /, which is defined as follows:
– The graph G D .V; E/, where V D I [ J [ fsg and E D f.i; j / j i 2 I; j 2
Si g [ f.s; i / j i 2 I g [ f.j; s/ j j 2 J g
– The vertex capacity function u, where u.s/ D 1, u.i / D wi for i 2 I and
u.j / D dj for j 2 J
– The vertex cost function c 0 , where c 0 .i / D wcii for i 2 I , c 0 .j / D 0 for j 2 J
and c 0 .s/ D 1 maxi 2I c 0 .i /
Accordingly, the integrality assumption yields that there is an optimal solution to
the above flow problem, in which the flow in every edge is integral (specifically, any
open base station i that serves a client j contributes at least one unit of the client’s
demand). Moreover, this solution can be computed efficiently using the known algo-
rithms for minimum-cost flow [2]. We denote the solution to LP2 which corresponds
to that flow by fNz; xN g. Let INj D fi 2 I W xN ij > 0g, for every client j 2 J . Note that
by this definition it follows that for every i 2 INj we have that
wi xN ij 1 : (6.18)
The algorithm generalizes the one of Chuzhoy and Naor [15] for set cover with hard
capacities. In this section we use their notation.
For a subset of base stations, H
I , let f .H / denote the maximum total de-
mand (in demand units, where the clients need not be fully covered) that can be
satisfied by the base stations in H . For i 2 I , define fH .i / D f .H [ fi g/ f .H /.
Note that when the is no interference, we can calculate f .H / using the following
linear program:
XX
Max wi xij (LP3 )
j 2J i 2H
X
S.t. wi xij dj ; 8 j 2 J; (6.19)
i 2H
X
xij 1; 8 i 2 H; (6.20)
j 2J
6 Resource Allocation Algorithms for the Next Generation Cellular Networks 117
0 xij 1; 8 i 2 H; j 2 Si ; (6.21)
xij D 0; 8 i 2 H; j … Si :
One can easily verify that the above process halts with a feasible solution with
the desired property.
Let i1 ; i2 ; : : : ; ik be the base stations that were chosen by Algorithm 4 to the
solution, in the order they were chosen. Let I`0 be the solution at the end of iteration `
of the algorithm. Let OPT be a set of base stations that comprise an optimal solution,
fNz; xN g.
Next, we inductively define for each iteration ` and i 2 OPT n I`0 a value a` .i /,
so that the following invariant holds: it is possible to cover all the clients using the
base stations in OPT [ I`0 with the capacities a` .i / for i 2 OPT n I`0 and wi for
i 2 I`0 . P
Let a0 .i / D j 2J wi xN ij . The invariant holds trivially. Consider the `th iteration.
By the induction hypothesis and Lemma 6.2, there exists a solution fz; xg of IP1
such that the base stations in I`0 satisfy a total of exactly f .I`0 / demand units and
0
each base station i 2 OPT P n I` satisfies at most a`1 .i / demand units. For each
0
i 2 OPT n I` let a` .i / D j 2J wi xij .
1: I 0 ;. P
2: while f .I 0 / < j 2J dj do
ci
3: Let i D arg mini2I WfI 0 .i/>0 fI 0 .i/
.
4: I0 I 0 [ fi g.
5: return I 0 .
In what follows, we charge the cost of the base stations that are chosen by
Algorithm 5 to the base stations in OPT. If i` 2 OPT, we do not charge any
base station for its cost, since OPT also pays for it. Otherwise, we charge each
ci
i 2 OPT n I`0 with f 0 ` .i / .a`1 .i / a` .i //. Notice that the total cost of i` is
I `
`1
indeed charged.
Consider a base station i 2 OPT. If i 2 I 0 , let h denote the iteration in which
it was added to the solution. Else, let h D k C 1. For ` < h, it follows from the
definition of a`1 .i / that fI`1
0 .i / a`1 .i /. By the greediness of Algorithm 4 it
holds that: ci` ci ci
;
fI`1
0 .i` / fI`1
0 .i / a`1 .i /
and the total cost charged upon i is
X
h1
ci`
h1X .a`1 .i / a` .i //
.a`1 .i / a` .i // ci
fI`1
0 .i` / a`1 .i /
`D1 `D1
D ci H.a0 .i //
D ci O.log a0 .i //
D ci O.log wi /;
where H.r/ is the rth harmonic number. This completes the analysis.
An additional approximation algorithm for CPP can be found in [4, 8]. This
algorithm is based on solving the LP-relaxation (LP2 ), and randomly rounding the
fractional solution
p to an integer solution. The integer solution it produces is within
a factor of O.W log n/ of the optimum, where as before, W D maxi 2I fwi g.
Practically speaking, two different simulation sets were conducted with scenar-
ios relevant to 4G technologies in [4, 8]. Each of these simulations has the goal of
minimizing the total cost and minimizing the total number of antennas (sectors). In
both simulation sets results indicate that the practical algorithms derived from the
theoretical scheme can generate solutions that are very close to the optimal solu-
tions and much better than the proved worst-case theoretical bounds. Moreover, the
algorithms presented in this section achieve a significantly better lower bound on the
solution cost than that achieved by the commonly used greedy approaches [28, 33].
BS1
BS4
BS4
BS3
Fig. 6.3 The cell selection problem. Unlike the previous cell planning situation, we are given a
set of base stations that are already selected for opening. The question now is how to best connect
the mobile stations to these base stations in order to maximize the number of mobile stations that
fully satisfy their bandwidth demands
when a mobile station joins the network (called cell selection), or when a mobile
station is on the move in idle mode (called cell reselection, or cell change, in HSPA).
In most current cellular systems the cell selection process is done by a local
procedure initialized by a mobile device according to the best detected SNR. In
this process the mobile device measures the SNR to several base stations that are
within radio range, maintains a “priority queue” of those that are best detected (those
whose pilots are comprised the active set), and sends an official service request
to subscribe to base stations by their order in that queue. The mobile station is
connected to the first base station that positively confirmed its request. Reasons for
rejecting service requests may be handovers or drop calls areas, where the capacity
of the base station is nearly exhausted.
Consider for example the settings depicted in Figure 6.4. Assume that the best
SNR for Mobile Station 1 (MS1) is detected from microcell A, and thus MS1 is
being served by this cell. When Mobile Station 2 (MS2) arrives, its best SNR is
also from microcell A, which is the only cell able to cover MS2. However, after
serving MS1, microcell A does not have enough capacity to satisfy the demand of
MS2 who is a heavy data client. However, if MS1 could be served by picocell B
then both MS1 and MS2 could be served. Note that MS1 and MS2 could represent a
cluster of clients. The example shows that the best-detected-SNR algorithm can be a
6 Resource Allocation Algorithms for the Next Generation Cellular Networks 121
microcell A
MS1
MS2
picocell B
Fig. 6.4 Bad behavior of the best detected SNR algorithm in high-loaded capacitated network
factor of maxfdQ g= minfdQ g from an optimal cell assignment, where dQ is the demand
of any mobile station in the coverage area. Theoretically speaking, this ratio can be
arbitrarily large.
This simple example illustrates the need for a global, rather than a local, cell
selection solution that tries to maximize the global utilization of the network, and not
just the SNR of a single user. In voice only networks, where base station capacities
are considered to be high, sessions have limited duration, and user demands are
uniform, this may not be a big barrier. That is, the current base station selection
process results, in most cases, in a reasonable utilization of the network. However,
in the forthcoming future cellular networks this may not be the case.
Another interesting aspect is the support for different QoS classes for the mobile
stations (e.g., gold, silver, or bronze). In such a case, the operator would like to have
as many satisfied “gold” customers as possible, even if this means several unsatisfied
“bronze” customers.
In this section we follow [3] and study the potential benefit of a new global
cell selection mechanism, which should be contrasted with the current local mobile
SNR-based decision protocol. In particular, we rigourously study the problem of
maximizing the number of mobile stations that can be serviced by a given set of
base stations in such a way that each of the serviced mobile station has its mini-
mal demand fully satisfied. We differentiate between two coverage paradigms: The
first is cover-by-one where a mobile station can receive service from at most one
base station. The second is cover-by-many, where we allow a mobile station to be
simultaneously satisfied by more than one base station. This means that when a mo-
bile stations has a relatively high demand (e.g., video-on-demand) in a sparse area
(e.g., sea-shore), several base stations from its active set can participate in its de-
mand satisfaction. This option is not available in third-generation networks (and not
even in HSPA networks) since these networks have universal frequency reuse and
the quality of a service a mobile station receives will be severely damaged by the
derived co-channel interference. However, OFDMA-based technology systems and
their derivatives are considered to be among the prime candidates for future cellular
122 D. Amzallag and D. Raz
The important goal of efficient solution to AoNDM is beyond our reach since
this problem is NP-hard. Moreover, it is not even possible to design a close-to-
optimal polynomial-time approximation algorithm for AoNDM, unless NPDZPP4 .
To be precise, since a solution for the general version of AoNDM can be used to
solve the Maximum Independent Set Problem in graphs (Problem GJ20 in [21]),
and since the latter cannot be approximated within a factor better than jJ j1" , unless
NPDZPP, for any " > 0, this hardness of approximation can be used as a lower
bound for AoNDM.
4
The class ZPP is equal to the intersection of the computational complexity classes RP and Co-RP
124 D. Amzallag and D. Raz
Motivated by practical scenarios where the network satisfies the condition that
d.j / r, a restricted version of AoNDM, the r-AoNDM problem was defined in
the previous section [3].
Two approximation algorithms for the r-AoNDM problem are presented in [3].
The algorithms are local-ratio-fashion based [10, 11] and based on a decomposi-
tion of the profit obtainable from every client into two non-negative terms; one part
is proportional to the demand of the client, while the other part is the remaining
profit. A family of feasible solutions is defined, which we dub “maximal” (see the
formal definition later in this section), and prove that any such “maximal” solution
is an approximate solution when considering a profit function which is proportional
to the demand. The approximation algorithms generate such maximal solutions re-
cursively, and an inductive argument is applied in order to prove that the solution
generated by the algorithm is also an approximate solution w.r.t. the original profit
function. We focus here only on one of these two approximation algorithms, the one
that guarantees a solution whose value is within a factor of 1r
2r
from the value of an
optimal solution. This algorithm follows the cover-by-one paradigm, and thus every
mobile station is covered by at most one base station.
The second algorithm is obtained by a careful refinement of this algorithm and
an appropriate change to the notion of maximality. This algorithm uses the cover-
by-many paradigm, and as shown in [3] is guaranteed to produce a solution whose
value is within a factor of .1 r/ from the value of an optimal solution, while the
complexity increases by a polynomial factor.
1r
6.3.3 A Cover-by-one 2r
-Approximation Algorithm
1: if J D ; then
2: Return empty assignment
3: J 0 D fj 2 J j p.j / D 0g
4: if J 0 ¤ ; then
5: Return r-AoNDM .I; J n J 0 ; c; d; p/
6: else n o
p.j /
7: ı D minj 2J d.j /
8: For every j 2 J , set p1 .j / D ı d.j /
9: x r-AoNDM .I; J; c; d; p p1 /
10: Using clients from J 0 , extend x to an ˛-cover w.r.t. J
11: Return x
˛ 1r
Theorem 6.2. Algorithm 6 produces an ˛C1
-approximate (or 2r
-approximate)
solution.
Proof. The proof is by induction on the number of recursive calls. The base case is
trivial. For the inductive step, we need to consider two cases. For the cover returned
in Step 5, by the induction hypothesis, it is an ˛C1 ˛
-approximation w.r.t. J nJ 0 , and
0 ˛
since all clients in J have zero profit, it is also an ˛C1 -approximation w.r.t. J . For
the cover returned in Step 11, note that by the induction hypothesis, the solution re-
˛
turned by the recursive call in Step 9 is an ˛C1 -approximation w.r.t. profit function
0
p p1 . Since every client j 2 J satisfies p.j / p1 .j / D 0, it follows that any ex-
˛
tension of this solution is also ˛C1 -approximation w.r.t. p p1 . Since the algorithm
extends this solution to an ˛-cover by adding clients from J 0 , and p1 is proportional
˛
to the demand, by Lemma 6.3 we have that the extended ˛-cover is an ˛C1 -
approximation w.r.t. p1 . By the Local-Ratio Lemma (see, e.g., [11]), it follows that
˛
this solution is an ˛C1 -approximation w.r.t. p, thus completing the proof.
There are several interesting problems that arise from this section. The first is
whether or not one can devise a constant-factor approximation algorithm to the
6 Resource Allocation Algorithms for the Next Generation Cellular Networks 127
This chapter describes approximation algorithms for several planning and control
problems in the context of advanced wireless technology. It seems that this area
provides many very interesting optimization problems. Moreover, theoretical com-
puter science based approaches and especially approximation algorithm techniques
have been shown to provide, in many cases, very good practical algorithms. We be-
lieve that these approaches can and should be further used to address the diverse
challenges in the design and planning of future cellular networks.
References
1. A. Ageev and M. Sviridenko. Approximation algorithms for maximum coverage and max cut
with given sizes of parts. In Proceedings of the Conference on Integer Programming and Com-
binatorial Optimization (IPCO), volume 1610 of Lecture Notes in Computer Science, pages
17–30. Springer-Verlag, 1999.
2. R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows (Theory, Algorithms, and Appli-
cations). Prentice Hall, 1993.
3. D. Amzallag, R. Bar-Yehuda, D. Raz, and G. Scalosub. Cell Selection in 4G Cellular Networks.
In Proceedings of the Annual IEEE 27th INFOCOM, pages 700–708, 2008.
4. D. Amzallag, R. Engelberg, J. Naor, and D. Raz. Cell planning of 4G cellular networks.
Technical Report CS-2008-04, Computer Science Department, Technion - Israel Institute of
Technology, 2008.
5. D. Amzallag, M. Livschitz, J. Naor, and D. Raz. Cell planning of 4G cellular networks: Algo-
rithmic techniques, and results. In Proceedings of the 6th IEE International Conference on 3G
& Beyond (3G’2005), pages 501–506, 2005.
6. D. Amzallag, J. Naor, and D. Raz. Coping with interference: From maximum coverage to
planning cellular networks. In Proceedings of the 4th Workshop on Approximation and Online
Algorithms (WAOA), volume 4368 of Lecture Notes in Computer Science. Springer-Verlag,
2006.
7. D. Amzallag, J. Naor, and D. Raz. Algorithmic aspects of radio access network design in
B3G/4G cellular networks. In Proceedings of the Annual IEEE 26th INFOCOM, pages 991–
999, 2007.
128 D. Amzallag and D. Raz
30. A. Sang, X. Wang, M. Madihian, and R. D. Gitlin. A Load-aware handoff and cell-site se-
lection scheme in multi-cell packet data systems. In Proceedings of the IEEE 47th Global
Telecommunications Conference (GLOBECOM), volume 6, pages 3931–3936, 2004.
31. A. Sang, X. Wang, M. Madihian, and R. D. Gitlin. Coordinated load balancing, handoff/
cell-site selection, and scheduling in multi-cell packet data systems. In Proceedings of the 10th
Annual International Conference on Mobile Computing and Networking (MOBICOM), pages
302–314, 2004.
32. M. Sviridenko. A note on maximizing a submodular set function subject to knapsack con-
straint. Operations Research Letters, 32:41–43, 2004.
33. K. Tutschku. Demand-based radio network planning of cellular mobile communication sys-
tems. In Proceedings of the IEEE 17th INFOCOM, pages 1054–1061, 1998.
34. N. Umeda, T. Otsu, and T. Masamura. Overview of the fourth-generation mobile communica-
tion system. NTT DoCoMo Technical Review, 2(9):12–31, 2004. Available at http://www.ntt.
co.jp/tr/0409/special.html.
35. J. Vygen. Approximation algorithms for facility location problems. Technical report 05950-
OR, Research Institute for Discrete Mathematics, University of Bonn, 2005. Available at http://
www.or.uni-bonn.de/ vygen/fl.pdf.
36. D. Wisely. Cellular mobile–the generation game. BT Technology Journal (BTTJ), 25(2):27–41,
2007.
Chapter 7
Ethernet-Based Services for Next
Generation Networks
Enrique Hernandez-Valencia
Abstract Over the last few years, Ethernet technology and services have emerged
as an indispensable component of the broadband networking and telecommuni-
cations infrastructure, both for network operators and service providers. As an
example, Worldwide Enterprise customer demand for Ethernet services by itself is
expected to hit the $30B US mark by year 2012. Use of Ethernet technology in the
feeder networks that support residential applications, such as “triple play” voice,
data, and video services, is equally on the rise. As the synergies between packet-
aware transport and service oriented equipment continue to be exploited in the path
toward transport convergence. Ethernet technology is expected to play a critical
part in the evolution toward converged Optical/Packet Transport networks. Here we
discuss the main business motivations, services, and technologies driving the speci-
fications of so-called carrier Ethernet and highlight challenges associated with deliv-
ering the expectations for low implementation complexity, easy of use, provisioning
and management of networks and network elements embracing this technology.
7.1 Introduction
Over the last few years, Ethernet technology and services have emerged as an
indispensable component of the broadband networking and telecommunications
infrastructure, both for network operators and service providers. As an example,
Worldwide Enterprise customer demand for Ethernet services by itself is expected
to hit the $30B US mark by year 2012 [47]. Use of Ethernet technology in the feeder
networks that support residential applications, such as “triple play” voice, data, and
video services, is equally on the rise. As the synergies between packet-aware trans-
port and service oriented equipment continue to be exploited in the path toward
transport convergence, [40]. Ethernet technology is expected to play a critical part in
the evolution toward converged Optical/Packet Transport networks. Here we discuss
the main business motivations, services, and technologies driving the specifications
E. Hernandez-Valencia ()
Bell Labs, Alcatel-Lucent, 600 Mountain Avenue, Murray Hill, NJ 07974,
e-mail: [email protected]
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 131
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 7,
c Springer-Verlag London Limited 2010
132 E. Hernandez-Valencia
of so-called carrier Ethernet and highlight challenges associated with delivering the
expectations for low implementation complexity, easy of use, provisioning and man-
agement of networks and network elements embracing this technology.
Ethernet, as the defined by the IEEE 802.1 Working Group [11–13,15], has a ver-
tically integrated architecture framework with physical and data link (MAC) layers
modeled after the OSI reference model. This layered architecture framework allows
Ethernet technology to be exploited in a variety of data networking and telecom-
munications roles, as illustrated in Figure 7.1, in support of Carrier Ethernet centric
packet services:
As a network infrastructure service used as one of the building blocks of a net-
work operator, or service provider, transport infrastructure (similar in scope to
WDM, SDH/SONET [18], or OTN [17] transport services)
As a connectivity service used to provide a high bandwidth alternative to Private
Lines, Frame Relay, ATM connectivity services for Enterprise site intercon-
nection or to provide a service interface used to enable business or residential
access to L3+ data services and associated applications (e.g., Internet Access,
IP VPNs, etc.)
In most applications, digital centric (TDM or Packet) approaches tend to be pre-
ferred in scenarios with high subscriber density; broad interconnect granularity (in
terms of bandwidth demand) and fluid service demand. Conversely, photonic centric
approaches tend to be preferred in scenarios with coarser interconnect granular-
ity and more predictable service demand. Here we present the evolving framework
for Carrier Ethernet services and discuss transport technologies required to support
what we refer to as Ethernet Transport capabilities.
7 Ethernet-Based Services for Next Generation Networks 133
The concept of Carrier Ethernet differs from enterprise Ethernet in both the service
models and their associated reference architectures. Enterprise Ethernet technol-
ogy is optimized for deployment on a private network scenario, with a highly
uniform user base, a high degree of trust on user behavior, and a strong appreci-
ation for plug-and-play functionality. Carrier Ethernet technology on the other hand
is intended to support a public network service, with strong requirements on for-
mal demarcation between subscriber and provider roles, high user heterogeneity,
weak subscriber-to-subscriber and subscriber-to-provider trust, and a pay-to-play
service model. In practicality, there is limited formal work on service model spec-
ifications. From a network infrastructure perspective most of the ITU-T work is
based on the concept of functional modeling [45] which attempts to character-
ize network technologies according to its forwarding paradigm (Circuit vs Packet
and Connection-oriented vs. Connectionless/Datagram) and the intrinsic informa-
tion elements associated with them. More recently, work has focused on the concept
of context aware services particularly as ambient networks [1].
The reference network model for Carrier Ethernet Services follows a traditional
architecture model for telecom public networking services [34, 36]. Figure 7.2
depicts such a reference model as defined by the MEF and ITU-T. This model
follows an emerging industry practice to partition networks along administrative
boundaries between subscribers, network operators, and service providers. A Ser-
vice Provider (SP) network, referred to as a Carrier Ethernet Network (CEN), is
demarked by a number of well-defined External Interfaces. Key among these inter-
faces is the User-to-Network Interface (UNI), which is used to demark the boundary
between the CEN and its service subscribers. Another important external interface,
the External Network-to-Network Interface (E-NNI), is used to demark the bound-
aries between CENs.
Subscribers attach their Customer Equipment (CE) to the CEN’s provider edge
(PE) equipment via such UNIs. The UNI is purposely specified to define the
expected forwarding behavior for the service frames, both on ingress (toward the
CEN) and egress (toward the CE) in terms of capabilities typically associated with
enterprise Ethernet technology. The format for the service frame is expected to com-
ply with the IEEE-defined Ethernet MAC frame format and allows for the use of
virtual LAN tags, also referred to as customer tags (C-Tags), as a means to clas-
sify various kinds of customer frames and flows. Ethernet services are then defined
as a set of processing and forwarding rules for the service frames across a number
of UNIs.
Ethernet services may vary in many ways. For instance, there are the so-called
“retail” services which are typically services intended to be sold to individual sub-
scribers. These services are based on “open” UNI service models and, as such,
tend to be defined from a subscriber perspective. There are also the so-called
“wholesale” services which are typically sold between service providers or service
operators themselves. These services may be based on “open” or private E-NNI im-
plementation agreements, and hence, tend to be described from a network operators
perspective.
A variety of transport technologies, and associated management and control
plane protocols, may be used to implement Ethernet service on a given CE. For
instance, SONET/SDH and DWDM/OTN technology use GFP[27] to support point-
to-point transport and services for Ethernet frames. Packet oriented technologies
such as Provider Bridging [13], Provider Backbone Bridging [14] and MPLS
[43, 44] /PWE3 [33] may also be used to natively support, or emulate multipoint
LAN-like services. Next we discuss different types of Ethernet services and some
of the important characteristics that distinguish them from other packet service
offerings.
Table 7.1 gives the names of the connectivity services so far specified by the MEF
and the ITU-T and the relationship between them.
Next we provide a short description of the main characteristics of these services
as defined by the MEF and ITU-T.
for transparent Ethernet LAN interconnects as well as for circuit emulation appli-
cations over Ethernet. An E-Line service is referred as an Ethernet Virtual Private
Line (EVPL) service when multiple service instance (i.e., service multiplexing) can
be associated with at least one of the UNIs participating in the given point-to-point
service instance. This service is intended for applications that can communicate over
a shared point-to-point communications channel. Hence, EVPL can provide similar
connectivity capabilities as point-to-point Frame Relay and ATM data services.
As in the example illustrated in Figure 7.3, an E-Line Service instance, EVC 1,
can be used to provide point-to-point connectivity to transfer Ethernet frames be-
tween symmetric ports with very strict performance commitments, e.g., a premium
circuit-like service between the 10 Mbps UNIs at CE 1 and CE 4. Another set of
E-Line Service instances may be used to provide point-to-point statistical access
to network resources. For instance, EVC 2 interconnects symmetric UNIs at 100
Mbps between CE 2 and CE 4 and provides committed network resources (includ-
ing bandwidth or frame delay commitments) as long as the traffic complies with
a contracted traffic descriptor. At the same time, EVC 3 interconnects two asym-
metric UNIs, at 10 Mbps on CE 3 and at 100 Mbps on CE 4, but may deliver no
performance assurances. In this case, the UNI port on PE4 would be configured as
a service multiplexing interface while the UNI ports on PE1 and PE3 are dedicated
to the given service instance.
Note that since E-Line only provides point-to-point connectivity it does not re-
quire support of Ethernet MAC bridging capability and associated Ethernet MAC
learning for frame forwarding purposes. Hence, service scaling in terms of switch-
ing and storage complexity is on par to Frame Relay and ATM switching.
The Ethernet LAN Services are intended to provide multipoint connectivity between
two or more UNIs via multipoint EVCs. An E-LAN service is referred as an Ethernet
Private LAN (EP-LAN) service when only one service instance can be associated
7 Ethernet-Based Services for Next Generation Networks 137
with the UNIs using the given service instance. This service is intended for appli-
cations where the endpoints need to appear to communicate over a dedicated LAN.
Hence, the network provides what is typically referred as a bridge LAN service as
endpoints communicate as if connected to an IEEE 802.1 VLAN bridge [11]. An
E-LAN service is referred as an Ethernet Virtual Private LAN (EV-PLAN) service
when multiple service instances (i.e., service multiplexing) can be associated with
at least one of the UNIs participating in the given multipoint service instance. This
service is intended for applications that can communicate over a shared multipoint
communications channel such as IP routers communicating over a shared Ethernet
LAN or for the interconnection of IEEE 802.1 VLAN bridges.
As in the example illustrated in Figure 7.4, an E-LAN service instance, EVC 1,
can be used to provide multipoint connectivity to transfer Ethernet frames among
a set of symmetric access ports (i.e., connecting CE 1, CE 5, and CE 6), say, with
either statically or statistically allocated transport resources for the interconnected
ports. Similarly, another set of E-LAN service instances may be used to provide
shared multipoint access to network resources. For instance, CE 2 and CE 3 can
use their UNI port to access either CE 4 or CE 5. EVC 2 may provide commit-
ted network resources (including bandwidth or frame delay commitments), as long
as the traffic complies with a contracted traffic descriptor. EVC 3 may deliver no
performance assurances, e.g., Best Effort quality of service.
Since an E-LAN service provides multipoint connectivity it does require either
direct support of Ethernet MAC bridging capability and associated Ethernet MAC
learning for frame forwarding purposes or emulation of such capabilities by some
other means, e.g., VPLS [31, 32]. Hence, service scaling in terms of switching and
storage complexity is higher than E-LINE but still lower than equivalent emulations
capabilities via Frame Relay and ATM switching as there is no native multipoint
forwarding mechanism in those technologies.
138 E. Hernandez-Valencia
The Ethernet Tree Services (E-TREE) are intended to provide multipoint connec-
tivity between two or more UNIs, referred to as Root UNIs, and point-to-point
connectivity between a Root UNI and a number Leaf UNIs, with no direct connec-
tivity between any two Leaf UNIs. An E-TREE service is referred as an Ethernet
Private Tree (EP-Tree) service when only one service instance can be associated
with the UNIs using the given service instance. This service is intended for appli-
cations where the endpoints need to appear to communicate over a dedicated LAN
but wish to restrict connectivity between certain sites. Hence, the network provides
what is typically referred as a “hub-and-spoke”. An E-TREE service is referred
as an Ethernet Virtual Private Tree (EVP-Tree) service when multiple service in-
stance (i.e., service multiplexing) can be associated with at least one of the UNIs
participating in the given multipoint service instance. This service is intended for
applications where communications with certain devices need must occur over a
given set of sites (the Root UNIs) nut communications to other devices may occur
over a shared point-to-point or multipoint communications channel.
As in the example illustrated in Figure 7.5, an E-TREE service instance, EVC 1,
can be used to provide rooted multipoint connectivity to transfer Ethernet frames
among a set of symmetric access ports (i.e., connecting CE 1, CE 5, and CE 6),
say, with either statically or statistically allocated transport resources for the inter-
connected ports. But unlike E-LAN, CE 5 and CE 6 can only communicate directly
with CE 1, but not directly with each other. Also similarly to E-LAN, another set
of E-TREE service instances may be used to provide shared rooted multipoint ac-
cess to network resources. For instance, CE 2 and CE 3 can use their UNI port to
access either CE 4 or CE 5 via their (Root) UNIs on PE4. EVC 2 may provide com-
mitted network resources (including bandwidth or frame delay commitments), as
long as the traffic complies with a contracted traffic descriptor. EVC 3 may deliver
no performance assurances, e.g., Best Effort quality of service. But, unlike E-LAN,
CE 2 and CE 3 cannot communicate directly among themselves as their EVCs are
provided over Leaf UNIs.
The Ethernet Service Attributes are the service constructs within the MEF Ethernet
Service Definition Framework[37] provided to allow further customization of
Ethernet service categories, hence, further characterizing the forwarding treatment
to be expected by the subscriber. Currently, Ethernet Service Attributes are arranged
into three groups:
1. UNI SA: apply to all services instances created on a specific UNI, hence they
can be set to a different value only on each UNI location. Attributes in this group
cover most physical media parameters, such as link speed, or port aggregation
capabilities. Table 7.2 provides a short description of the service attributes in the
UNI SA group.
2. EVC Endpoint SA: or UNI per EVC SA in MEF terminology, apply to a specific
service instance at a given UNI. EVC Endpoint SA, they can be set to differ-
ent values on each end of the EVC. Attributes in this group include most of the
direction-specific parameters such as packet filtering rules and traffic descrip-
tors. Table 7.3 provides a short description of the service attributes in the EVC
Endpoint SA group.
3. EVC SA: apply to the entire connection irrespective of direction. Attributes in
this group include most of the performance affecting parameters, such as con-
nection protection model and performance objectives. Table 7.4 provides a short
description of the service attributes in the EVC SA group.
Below we discuss some of the more critical service attributes for service
differentiation.
The EVC type service attribute indicates the kind of connectivity to be established
among the relevant UNIs. Three EVC types are currently specified:
1. Point-to-Point: The EVC associates exactly two UNIs and there are not restric-
tions on the bi-directional connectivity between them.
140 E. Hernandez-Valencia
2. Multipoint: The EVC associates two or more UNIs and there are not restrictions
on the bi-directional connectivity between them.
3. Rooted-Multipoint: The EVC associates two or more UNIs. One or more UNIs
are declared Roots and the rest are declared Leafs. There are not restrictions
on the bi-directional connectivity between the Root UNIs. Leaf UNIs can only
communicate directly with the Root UNIs.
Notice that in practicality Ethernet services can be delivered from a combina-
tion of the above connection types, i.e., some portions of an E-LAN EVC may be
supported via point-to-point connections.
7 Ethernet-Based Services for Next Generation Networks 141
The MEF service framework allows for one or more CoS instances (or set of service
frames to which a CoS commitment may apply) to be associated with a single EVC.
A minimum number of traffic classification mechanisms are currently identified for
the purpose of CoS instance determination via the CoS ID service attribute:
Physical Port
User Priority/Priority Code Point (as per IEEE P802.1ad [13])
IP/MPLS DiffServ/IP TOS (as per IETF RFC 2475 [3])
Ethernet (Layer 2) Control Protocols
The service provider will then enforce different traffic descriptors for each CoS
instance. Each CoS instance will offer different levels of performance as specified
in the performance parameters per class of service, e.g., delay, jitter, and loss. The
following subsections will explore each of the aforementioned CoS identifiers.
In this case a single CoS instance is associated with all service frames across
the physical port irrespective of the number of EVCs that may be configured on the
UNI. The port based CoS ID provides the simplest form to specify a CoS instance
using the minimum amount of subscriber information. Port-based CoS indications
are well-suited for “hose” or “point-to-cloud” service models. Yet, it provides the
highest challenge for network resource allocation to the service provider [5].
In this case up to eight CoS instances can be associated with non-overlapping set
of service frames of a given EVC by looking into the Ethernet MAC Priority Code
Point (PCP) field as per clause 9 IEEE 802.1Q [13] (ex User Priority filed in IEEE
802.1p). This option is only applicable when the subscriber Ethernet frames are
tagged either with a Priority tag or a VLAN tag as per IEEE 802.1 Q [11]. The
PCP based CoS ID provides a means for customers to indicate QoS commitments,
including frame delay and frame loss precedence, for its service frames. Either rel-
ative (i.e., as in IETF DiffServ) or strict QoS commitments can be offered under
this CoS ID by encoding both the service frame priority and drop precedence in
the PCP field, using the IEEE defined 8P0D, 7P1D, 6P2D, or 5P3D CoS encoding
model. Here, the digit before P indicates the number of distinct frame delay prior-
ity levels and the digit before D the number of frame drop precedence levels being
communicated. UPC-based CoS indications are to traditional connection-oriented
QoS model such as in ATM, Frame Relay and IP Integrated Service/Differentiated
service models [2].
7 Ethernet-Based Services for Next Generation Networks 143
In this case information from either an IPv4/IPv6 TOS field or DiffServ fields can
be used to associate a CoS instance to a non-overlapping set of service frame on a
given EVC. IP TOS, in general, can be used to provide up to eight CoS instances to
a given EVC. The IP ToS precedence model is similar to the 8P0D QoS precedence
model in IEEE 802.1Q. DiffServ, by contrast, defines several per-hop behaviors
(PHBs) that can be used to provide more granular QoS capabilities when compared
to the simple forwarding precedence based on the IP TOS field. DiffServ can be
used to specify up to 64 QoS encodings (called DiffServ codepoints or DSCPs) that
can be used to define, theoretically, up to 64 CoS instances per EVC (note that CoS
instance identification is not allowed on a per IP address basis).
Standardized DiffServ PHBs include Expedited Forwarding (EF) for a low delay,
low loss service, four classes of Assured Forwarding (AF) for bursty real-time and
non-real-time services, Class Selector (CS) for some backward compatibility with
IP TOS, and Default Forwarding (DF) for best effort services. Unlike the port and
PCP COS ID indicators the DiffServ and IP TOS based CoS ID indicators require
the subscriber CE and provider PE to inspect and classify the service frames as per
the IP/MPLS packet header in the Ethernet frames payload. Hence, the price to be
paid for the additional CoS granularity is increased configuration complexity when
establishing the EVC.
In this case a single CoS instance can be associated with all service frames
conveying IEEE 802 based control protocols such as PAUSE messages and
Link Aggregation Control Protocol in IEEE 802.3 or Spanning Tree Protocols
or GARP/GVRP in IEEE 802.1Q. The Layer 2 Control Protocol based CoS ID
provides the simplest form to specify QoS differentiation to user control traffic over
user data frames. Note that uses of PCP values for this purpose would require careful
configuration on the frame classification rules in the subscriber networks as Ethernet
control protocols are typically carried untagged in IEEE 802 based networks.
CIR (t − t )
Bc(tj) = min CBS, Bc (tj−1) + × j j −1
8
CIR
O (tj) = max 0, Bc (tj−1 ) + × (tj − tj −1) − CBS
8
EIR × (t −t )
Be(tj) = min EBS, Be (tj−1 )+ j j−1 + CF × O (tj)
8
As noted before, the pair or parameters CIR/CBS and EIR/EBS characterize the
behavior of the two-rate bandwidth profile specified by the MEF. The CIR param-
eter indicates the long-term average rate allocated to the frames that draw their
transmission opportunity from the presence of sufficient credits from the “green”
token bucket. The CBS actually indicates the maximum number of tokens that can
be accumulated in the “green” token bucket, or the maximum frame burst size if
multiple frames could arrive at the same time. Similarly, the EIR parameter indi-
cates the long-term average rate allocated to the frames that draw their transmission
opportunity from the presence of sufficient credits from the “yellow” token bucket.
The EBS indicates the maximum number of tokens that can be accumulated in the
“yellow” token bucket, or the maximum frame burst size if multiple frames could
arrive at the same time.1
The parameters CF and CM determine how tokens are drawn and disposed. CF is a
binary parameter that determines whether unused frame transmission opportunities
lost from a full “green” token bucket can be used to replenish the “yellow” token
bucket. Lost transmission opportunities from the “yellow” token are never reused.
The CM parameter indicates whether the Ethernet service provider recognizes frame
coloring on the incoming frames, prior to the application of the BWP algorithm. CM
is said to be Color-Blind if the service provider always deems the initial color of any
arriving frame to be “green”. On the other hand, CM is said to be Color-Aware if
the service provider recognizes the incoming color of the frame.
The frame color is used to determine the set of frames that qualifies for QoS com-
mitments. Frames colored “green” qualify for any of the performance objective
specified for the relevant CoS. Frames colored “yellow” qualify for “Best Effort”
treatment (up to the committed EIR). Frames can be colored by either the sub-
scriber or the service provider. Frame coloring by the subscriber is relevant only
if the subscriber shapes the relevant CoS instance to conform to the negotiated CIR
and service provider recognizes the frame colors generated by the subscriber. Frame
coloring by the subscriber happens as an outcome of applying the BWP algorithm.
1
Note that the actual maximum burst size at the UNI, for instance, would be a function of the
UNI link rate, the CIR and the CBS (for the “green” token bucket) or the EIR and the EBS (for
the “yellow” token bucket). The Optical Transport Hierarchy encompasses integrated photonic and
digital transmission and switching capabilities for next-generation transport systems.
146 E. Hernandez-Valencia
Frames conforming to its intended token bucket retain their color as they traverse the
CEN. Frames that fail to conform to its intended token bucket are re-colored: non-
conforming “green” frames are re-colored “yellow” and non-conforming “yellow”
frames are re-colored “red”. “Red” frames are not required to be transported across
the CEN.
The performance service attribute conveys the committed quality of service associ-
ated with the identified CoS Instance. There are three parameters currently specified:
Frame delay
Frame jitter
Frame loss
Frame loss is the most elemental performance metric for most packet service as
it determines the actual throughput delivered by the service. Frame loss is defined
the percentage of CIR-conformant (green) service frames not delivered between an
ordered pair of UNIs over the sampling interval. Hence, let TX ij denote the total
number of green service frames received by UNIi and intended to UNIj and let RXij
the number of green service frame received from UNI i by UNI j then FLij , the Frame
Loss between UNI i and UNI j can be expressed as
where,
1 if x y
I.x; y/ D
0 otherwise,
A limitation of defining frame delay metric in term of percentiles of the frame
delay distribution is the difficulty allocation delay budgets to network components
or network domains in order to achieve a given end-to-end delay objective.
Frame delay variation (FDV), also referred to as frame delay jitter, is a critical
performance parameter for real-time applications such as IP telephony, Circuit
Emulation, or broadband video distribution. These real-time applications require
a low and predictable delay variation to ensure timely play-out of the transferred
information. Unfortunately there is no universal agreement on metric for frame de-
lay variation. The dominant concern is applicability to application requirements vs.
computation complexity. One metric for FDV, used by the MEF, defines FDV in
terms of a high-percentile of the distribution between the sample frame delay and
the minimum frame delay for the target set of UNIs over a target sampling period.
Let VT D fdijkl j8i; j such that al ak D t; ak 2 T; and alj 2 T g be the
set of all delay variations for all eligible pairs of Green Service Frames where ak
148 E. Hernandez-Valencia
represents the arrival time of the kth Green Service Frame in the set. Let N be the
number of samples in VT . Define dQijP to be the P -percentile of the set VT . Thus
P
Q P minfd jP 100
K I.d; dij /g if K 1
dij D
0 otherwise,
where:
1 if d ıij
I.d; dij / D ;
0 otherwise,
and the sum is carried out over all the values in the set VT . An alternative definition
of FDV is as the percentile of the difference vs. the observed delay and some fixed
component, such as the minimum or average delay.
the multipoint nature and uncertainty in the actual traffic flow between the multiple
connection points it is extremely difficult to provision the network to meet all traffic
demand without potentially wasting substantial network resources. Techniques as
dual-hop load balancing, also named Valiant Load Balancing (VLB) [30, 50], could
be used as a forwarding approach robust to traffic fluctuations for E-LAN emulation
using full mesh connectivity. Yet, the scheme increases transport delays and it is un-
friendly to specific Ethernet specific forwarding functions (such as unknown traffic
replication). Handling resource allocation as a multicommodity flow problem [10]
can be impractical for very large networks [1].
Solutions in this application space deliver “virtual fiber” based connections required
to implement an Ethernet-oriented transport network infrastructure. The functional
blocks used to implement these services are modeled after ITU-T requirements for
network transport equipment [22]. Note that use of IEEE 802.3 physical layer in-
terfaces to interconnect network equipment also falls into this application space.
A “virtual fibre” Ethernet Network Service is intended to emulate a physical
medium or “wire”. It is typically realized as a point-to-point constant-bit rate (CBR)
transport capability that extends the reach of a defined Ethernet PHY (IEEE 802.3).
It may be complemented with networking mechanisms from an under optical layer
network for switching, OA&M and trail protection, including PHY layer extensions
150 E. Hernandez-Valencia
Och Switch
Fibre
for link fault detection, performance management, and client signal fail (when the
emulated PHY lacks such features). Network elements for this application space are
expected to be managed consistent with established architectural and operational
models for transport network operators [20, 21]. Products that address this appli-
cation space fall within what is referred to as Multi-Service Transport Platforms
(MSTP)/Multi-Service Provisioning Platforms (MSPP) market segment and they
include network elements based on:
Ethernet-over-WDM (EoW)
Ethernet-over-SDH/SONET (EoS) – Type 1 Services
Ethernet-over-OTH (EoP)2
Figure 7.7 illustrates the functional framework for WDM and TDM solutions
addressing the “virtual fibre” application space.
Network elements in this application space deliver “virtual circuit” based connec-
tions required to implement an Ethernet-oriented transport network infrastructure.
2
The Optical Transport Hierarchy encompasses integrated photonic and digital transmission and
switching capabilities for next-generation transport systems.
7 Ethernet-Based Services for Next Generation Networks 151
The functional blocks used to implement these services are modeled after ITU-T re-
quirements for network transport equipments [19] and the functional model for the
Ethernet MAC [22]. A “virtual circuit” Ethernet Transport Services is intended to
emulate a “shared” or “fractional” link. It is typically realized as a point-to-point or
point-to-multipoint variable-bit rate (VBR) packet-oriented transport capability that
provides differentiable level of services independent of the access media speed). It
may be complemented with networking mechanisms from an under optical layer
network, mostly for OA&M and trail protection (when the emulated data link lacks
such features). Network elements for this application space are also expected to
be managed consistent with established architectural and operational models for
transport network operators. Yet, they must also incorporate best-in class data net-
working procedures that allow for forwarding and control plane independence, and
packet-oriented OAM. Products that address this application space fall within what
is referred to as the Next Generation Multi-Service Provisioning Platforms (NG
MSPP)/ Optical Packet Transport System (OPTS) market segments and they include
network elements based on:
Ethernet-over-Fibre (EoF) a.k.a. Carrier Ethernet
Ethernet-over-SDH/SONET (EoS) – Type 2 Services
Ethernet-over-(T)MPLS (EoM)
Figure 7.8 illustrates the functional framework for the hybrid WDM/TDM/Packet
solutions addressing this application space.
Lambda 802.3
(“Color ed” (B&W
CBR – ODU CBR – SDH VBR MPLS VBR – PB
Optics) Optics)
OchSwitch
Fibre
Solutions in this application space may be based on either “virtual fibre” or “virtual
circuit” interconnect capabilities in order to be optimized for Enterprise data net-
working applications. Ethernet connectivity services are based on Metro Ethernet
Forums E-Line and E-LAN service definitions and they use the IEEE 802.1 MAC
as the basis for their services attributes.
Ethernet connectivity services allow for a variety of enterprise data networking
applications. Specifically:
Ethernet Private Lines and Private LANs may be delivered through CBR-
oriented, “virtual fibre” transport solutions addressing mission-critical Enterprise
services. They are managed consistently with established operational models for
private managed networks (a subset of the network operation model in [20, 21]).
Ethernet Virtual Private Lines and Virtual Private LANs may be delivered
through VBR-oriented, “virtual circuit” transport solutions addressing public/
private Enterprise data networking services. They are managed consistently with
operation models for public data networks [22, 25] (either via vendor specific
EMS/NMS or Web/SNMP based 3rd party management tools).
The solution space is addressed by products designed for the Optical Packet
Transport Systems (OPTS) market segment with integrated Packet, TDM, and
WDM fabrics.
References
1. F. Belqasmi, R. Glitho, and R. Dssouli. Ambient network composition. IEEE Network Maga-
zine, Jul/Aug 2008.
2. J. C. R. Bennett, K. Benson, A. Charny, W. F. Courtney, and J. Y. Le-Boudec. Delay jitter
bounds and packet scale rate guarantee for expedited forwarding. IEEE/ACM Transactions on
Networking, 10(4), August 2002.
3. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. An Architecture for
Differentiated Service. RFC 2475 (Informational), December 1998.
4. N. Brownlee and K.C Claffy. Understanding internet traffic streams: Dragonflies and tortoises.
IEEE Communications Magazine, 40(10), August 2002.
5. N.G. Duffield, P. Goyal, A. Greenberg, P. Mishra, K.K. Ramakrishnan, and J.E.V.D. Merwe.
Resource management with hoses: Point-to-cloud services for virtual private networks.
IEEE/ACM Transactions on Networking, 10(5):679–692, 2002.
6. K. Elmeleegy, A. L. Cox, and T. S. Ng. Etherfuse: An ethernet watchdog. In ACM SIGCOMM,
2007.
7. K. Elmeleegy, A L. Cox, and T. S. Ng. Understanding and mitigating the effects of count to in-
finity in ethernet networks. IEEE/ACM Transactions on Networking, 17(9):186–199, February
2009.
8. S. Acharya et al. PESO: Low overhead protection for ethernet over SONET transport. In IEEE
Infocom, 2004.
9. A. Feldmann, A. Greenberg, C. Lund, N. Reingold, J. Rexford, and F. True. Deriving traffic
demands for operational ip networks: Methodology and experience. IEEE/ACM Transactions
on Networking, 9(3), June 2001.
10. A. Gupta, J. Kleinberg, A. Kumar, R. Rastogi, and B. Yener. Provisioning a virtual private
network: A network design problem for multicommodity flow. In ACM Symposium on Theory
of Computing (STOC), 2001.
11. IEEE standard 802.1Q – IEEE standards for local and metropolitan area networks – virtual
bridged local area networks. Institute of Electrical and Electronics Engineers, May 2003.
12. IEEE standard 802.1d – IEEE standard for local and metropolitan area networks: Media access
control (MAC) bridges. Institute of Electrical and Electronics Engineers, June 2004.
13. IEEE standard 802.1ad – IEEE standard for local and metropolitan area networks – virtual
bridged local area networks – amendment 4: Provider bridges. Institute of Electrical and Elec-
tronics Engineers, May 2006.
14. IEEE project 802.1ah – provider backbone bridging. Institute of Electrical and Electronics
Engineers, June 2008. See http://www.ieee802.org/1/pages/802.1ah.html.
15. IEEE standard 802.3 – 2005, information technology – telecommunications and information
exchange between systems – local and metropolitan area net-works – specific requirements –
part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and
physical layer specifications. Institute of Electrical and Electronics Engineers, December 2005.
16. IETF TRILL working group. Internet Engineering Task Force, http://www.ietf.org/html.
charters/trill-charter.html, 2009.
17. ITU-T recommendation G.707 network node interface for the synchronous digital hierarchy
(SDH). International Telecommunication Union, December 1998.
18. ITU-T recommendation G.709. interfaces for the optical transport network (OTN). Interna-
tional Telecommunication Union, March 2003.
154 E. Hernandez-Valencia
19. ITU-T recommendation G.8011 ethernet over transport – ethernet services framework. Inter-
national Telecommunication Union, August 2005.
20. ITU-T G.798. characteristics of optical transport network hierarchy equipment functional
blocks. International Telecommunication Union, December 2006.
21. ITU-T recommendation G.783. characteristics of synchronous digital hierarchy (SDH) equip-
ment functional blocks. International Telecommunication Union, March 2006.
22. ITU-T recommendation G.8021 characteristics of ethernet transport network equipment func-
tional blocks. International Telecommunication Union, June 2006.
23. ITU-T recommendation G.8110.1, architecture of transport MPLS layer network. International
Telecommunication Union, 2006.
24. ITU-T recommendation G.8113, requirements for operation & maintenance functionality in
t-mpls networks. International Telecommunication Union, 2006.
25. ITU-T draft recommendation G.8114, operation & maintenance mechanism for T-MPLS layer
networks. International Telecommunication Union, 2007.
26. ITU-T Y.1540. internet protocol data communication service – IP packet transfer and availabil-
ity performance parameters. International Telecommunication Union, 2007.
27. ITU-T Recommendation G.7041. Generic Framing Procedure (GFP), 2008.
28. C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE: A scalable ethernet architecture
for large enterprises. In ACM SIGCOMM, 2008.
29. E. Knightly and N.B. Shroff. Admission control for statistical QoS: theory and practice. IEEE
Network Magazine, 13(2), 1999.
30. M. Kodialam, T.V. Lakshman, and S. Sengupta. Efficient and robust routing of highly variable
traffic. In HotNets III, 2004.
31. K. Kompella and Y. Rekhter. Virtual Private LAN Service (VPLS) Using BGP for Auto-
Discovery and Signaling. RFC 4761 (Proposed Standard), January 2007.
32. M. Lasserre and V. Kompella. Virtual Private LAN Service (VPLS) Using Label Distribution
Protocol (LDP) Signaling. RFC 4762 (Proposed Standard), January 2007.
33. L. Martini, E. Rosen, N. El-Aawar, and G. Heron. Encapsulation Methods for Transport of
Ethernet over MPLS Networks. RFC 4448 (Proposed Standard), April 2006.
34. MEF4: Metro ethernet network architecture framework - part 1: Generic framework. Metro
Ethernet Forum, May 2004.
35. MEF6.1: Ethernet services definitions - phase 1. Metro Ethernet Forum, June 2004.
36. MEF12: Metro ethernet network architecture framework: Part 2: Ethernet services layer. Metro
Ethernet Forum, April 2005.
37. MEF10.1: Ethernet services attributes - phase 2. Metro Ethernet Forum, November 2006.
38. A. Myers, T. S. Ng, and H. Zhang. Rethinking the service model: Scaling ethernet to a million
nodes. In Third Workshop on Hot Topics in Networks (HotNets-III), 2004.
39. I. Norros. On the use of fractional brownian motion in the theory of connectionless networks.
IEEE Journal of Selected Areas in Communications, 13(6), August 1995.
40. Market alert: 4Q07 and global 2007 optical networking. Ovum, March 2008.
41. R. Perlman. Rbridges: Transparent routing. In IEEE Infocom, 2004.
42. H. Ren and K. Park. Towards a theory of differentiated services. In Proceedings of Quality of
Service, Eighth International Workshop, 2000.
43. E. Rosen, D. Tappan, G. Fedorkow, Y. Rekhter, D. Farinacci, T. Li, and A. Conta. MPLS Label
Stack Encoding. RFC 3032 (Proposed Standard), January 2001.
44. E. Rosen, A. Viswanathan, and R. Callon. Multiprotocol Label Switching Architecture. RFC
3031 (Proposed Standard), January 2001.
45. M. Sexton and A. Reid. Broadband Networking: ATM, SDH, and SONET. Artech House
Publishing, 1997.
46. S. Sharma, K. Gopalan, S. Nanda, and T. Chiueh. Viking: A multi-spanning-tree ethernet
architecture for metropolitan area and cluster networks. In IEEE Infocom, 2004.
47. Business ethernet services: Worldwide market update (MEF). Vertical System Group, January
2008.
7 Ethernet-Based Services for Next Generation Networks 155
48. D. Wischik and N. McKeown. Part I: Buffer sizes for core routers. ACM SIGCOMM Computer
Communication Review, 35(2), July 2005.
49. L. Yao, M. Agapie, J. Ganbar, and M. Doroslovacki. Long range dependence in internet back-
bone traffic. In IEEE International Conference on Communications, 2003.
50. R. Zhang-Shen and N. McKeown. Designing a predictable internet backbone network. In Hot-
Nets III, 2004.
Chapter 8
Overlay Networks: Applications, Coexistence
with IP Layer, and Transient Dynamics
8.1 Introduction
Despite the ease of deploying new applications in the Internet, the network layer in
the current Internet fails to provide the flexibility and stringent quality-of-service
(QoS) guarantees required by some delay and/or loss sensitive applications like
Voice-over-IP and real-time multimedia streaming. There are several reasons for
this. First, link failures are a commonplace occurrence in today’s Internet [30, 31],
and trouble shooting routing problems is extremely challenging. While intra-domain
routing protocols can take several seconds to reconverge after a failure [20], inter-
domain path restorations can take several minutes due to the slow convergence of
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 157
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 8,
c Springer-Verlag London Limited 2010
158 C.-N. Chuah and R. Keralapura
BGP, resulting in poor performance of real time applications. Many studies have
contributed to a better understanding of the root causes of BGP anomalies (e.g.,
failures, misconfigurations, malicious hijacking, etc.) [17, 25, 26, 29, 41, 43]. How-
ever, given the tremendous volume of BGP routing updates, it is difficult to detect
BGP anomalies in real time, let alone fix them. Secondly, applications are con-
strained by the network layer routing and peering policies of the Internet service
providers (ISPs). As shown in [36], the paths provided by ISPs based on their rout-
ing and peering policies are usually much longer than the shortest possible path.
Thirdly, strategic techniques such as multihoming [38] or hot-potato routing result
in asymmetric inter-domain paths, making it more challenging for applications to
predict their end-to-end routing performance. Fourthly, network layer support for
new services such as multicast, mobility, and efficient content distribution to a large
number of users, to name but a few, requires a large-scale infrastructure change in
all ISPs. Such a change is impractical (although not impossible) considering the cost
and effort involved to support each of the new services.
To address these issues in the Internet, application developers and service
providers have started using application-layer networks, more popularly known
as overlay networks. An overlay network typically consists of nodes located in one
or more network domains that communicate with one another at the application
layer. Application traffic can be routed from one overlay node to another solely
on the default network layer paths or by using an intermediate overlay node as a
forwarding agency, thereby forming paths that are not readily provided by the native
layer. An overlay network typically monitors multiple paths between pairs of nodes
and selects one based on its own requirements of end-to-end delay, loss rate, and/or
throughput. By doing so, it gives applications more control over routing decisions,
instead of being constrained by the network layer.
In other words, an overlay network adds an additional layer of indirection on top
of one or more physical networks. As a result, it can provide additional routing ser-
vices to applications and, in some cases, processing resources to overlay users. For
example, overlay nodes can forward data traffic onto other overlay nodes, forming
an “indirect” overlay path that can be drastically different from the default network
layer path chosen by the specific intra-domain or inter-domain protocols. Overlay
nodes typically encapsulate the application-layer data that now include information
regarding overlay routing control. Note that the network layer is oblivious to the
routing performed at the application layer. Once an overlay node receives a packet,
it decapsulates the application-layer data and decides how to handle the packet (i.e.,
forward the packet to one or more overlay nodes, spawn other processes on the
overlay node, etc.). Therefore, overlay traffic from different source–destination pairs
will share bandwidth and processing resources at the same overlay node that they
traverse. From the perspective of overlay users/flows, they are now sharing (and
competing for) the physical resources of the underlying nodes and links.
8 Overlay Networks: Applications, Coexistence, and Dynamics 159
Overlay networks can be classified into two broad categories based on their
design: infrastructure-based and noninfrastructure-based overlay networks.
Infrastructure-based overlays rely on a preselected set of nodes (i.e., regular infras-
tructure) to provide overlay services, while noninfrastructure-based networks do
not have a fixed infrastructure, but depend on nodes that frequently enter and leave
the network. In the rest of this chapter, we refer to noninfrastructure-based overlays
as peer-to-peer (P2P) overlays. Some of the popular infrastructure-based overlay
networks are Detour [36], RON [12], Akamai [1], OPUS [13], PlanetLab [9], etc.
Similarly BitTorrent [2], Napster [7], Kazaa [6], Gnutella [4], Chord [40], and
Bamboo [34] are examples of popular P2P overlays.
Infrastructure-based overlay networks require an organization to own the com-
plete application-layer network and administer it. For example, consider the Akamai
network that provides a fast delivery mechanism between content providers and
their customers. This overlay network spans several different autonomous systems
(ASes) in different continents. However, the entire infrastructure of the overlay net-
work is owned and administered by Akamai. Detour [36] and RON [12] also fall
into this category and provide generic routing services to achieve reliability and
fault tolerance that are not guaranteed by the IP layer. On the other hand, P2P over-
lay networks are built dynamically based on participating end hosts. Popular file
sharing services like BitTorrent form application-layer networks using the nodes of
the users who login and use this network to exchange audio, video, and data files
among themselves. Notice that the current users can leave the network at anytime
and new users can join the network frequently. Hence these overlay networks do not
have a fixed topology or connectivity, but depend on the number of users logged in,
their locations, and the files that they own.
While different overlay networks designed for a wide range of applications
may differ in their implementation details (e.g., choice of topologies or perfor-
mance goals), most of them provide the following common set of functionalities:
path/performance monitoring, failure detection, and restoration. Most overlay
routing strategies select a path between a source–destination pair with the best
performance in terms of delay, throughput, and/or packet loss. Overlay networks
monitor the actively used paths by sending frequent probes to check if the paths
adhere to acceptable performance bounds. If a problem is detected (e.g., failures or
congestion), the overlay network will select an alternate path to use.
Figure 8.1 illustrates how overlay networks operate and provide the intended
functionality to the end users. It shows two example overlays built on top of the
same physical network. Nodes A, B, C , D, E, F , G, H , and Q belong to the
underlying IP network. Endnodes A1 , B1 , C1 , and D1 belong to the first over-
lay network whereas A2 , E2 , F2 , and D2 belong to the second overlay network.
Note that the overlay nodes sit on top of the corresponding IP layer node. Let us
consider A-D source–destination pair. The default path selected by the IP layer is
A G H Q D. Now suppose that link G H is congested, resulting in poor
performance. The overlay networks will detect this and reroute the traffic through
160 C.-N. Chuah and R. Keralapura
another path (that avoids link G H ) that offers better performance. For example,
in the first overlay, endnode A1 can forward all the traffic to node B1 and B1 can in
turn forward all the traffic to D1 . Thus the overlay path taken by the traffic is now
A1 B1 D1 , which translates to the IP layer path AE B C F D. Similarly,
the overlay and IP layer paths for the second overlay network can be A2 E2 D2
and A E H Q D, respectively.
Thus the actual IP layer path taken by the traffic is now A E B C F D.
Similarly, the second overlay can use the path A E H Q D.
The overlay approach has its own advantages and disadvantages. The main benefit
of overlay networks is its ease of deployment without having to add new equipment
or modifying existing software/protocols in the existing Internet infrastructure. In-
stead, the required modifications are introduced at the end hosts in the application
layer. This allows bootstrapping and incremental deployment, since not every node
needs or wants the services provided by a specific overlay network. On the other
hand, adding an additional layer of indirection does add overhead in terms of ad-
ditional packet headers and processing time. Additional layer of functionality also
introduces complexity that can be hard to manage and may lead to unintended in-
teractions with other existing layers.
We will focus our discussion on infrastructure-based overlays that aim to pro-
vide fault-resilient routing services to circumvent the performance problems of the
IP layer routing. In particular, we will explore how allowing routing control at two
independent layers, i.e., the application and the IP layers, could lead to short-term or
long-term traffic oscillations. As pointed out in our previous work [24], such unin-
tended interactions have profound implications on how ISPs design, maintain, and
8 Overlay Networks: Applications, Coexistence, and Dynamics 161
run their networks. Given the increasing popularity of overlay networks, it is critical
to address issues that arise from the interaction between the two layers to ensure
healthy and synergistic coexistence. As more overlays are deployed, the volume
of traffic that they carry is increasing [39]. We show in [22] that since most over-
lay networks are designed independently with different target applications in mind,
their routing decisions may also interfere with each other. This article surveys ex-
isting work that attempts to seek a better understanding of the dynamic interactions
between overlay and IP layer, as well as across multiple overlays.
The rest of the article is organized as follows. Section 8.2 contains an overview
of various applications of overlay networks. In Section 8.3, we explore the prob-
lematic interactions that can occur between IP and overlay networks. For example,
traffic matrices become more dynamic and ambiguous, making them harder to es-
timate, and load-balancing policies at IP layer can be undermined. We will discuss
existing works that apply game theory to study such interactions both in equilibrium
and during the transient period. We examine interactions across multiple layers in
Section 8.4 and determine the conditions under whichweb site they occur, model the
synchronization probability, and seek solutions to avoid those problematic scenar-
ios. We summarize our findings and outline lessons learned in Section 8.5.
ISPs manage the performance of their networks in the presence of failures or con-
gestion by employing common traffic engineering (TE) techniques such as link
weight settings, load balancing, and routing policies. On the other hand, overlay
networks attempt to provide delay and loss sensitive applications with more control
in choosing end-to-end paths (hence bypassing ISP-dictated paths) to achieve bet-
ter performance in the presence of failures or high loads. The question arises as to
whether the two independent routing control layers will inadvertently interact with
each other and hamper each layer from achieving their respective goals.
8 Overlay Networks: Applications, Coexistence, and Dynamics 163
The interaction between the overlay networks and the IP layer was first identified
by the work of Qiu et al. [33], where the authors investigate the performance of
selfish overlay routing in Internet-like environments. The approach in this paper
was to model overlay routing and IP routing as a game theoretic problem. In this
game, overlay networks and IP network take turns in playing the game before they
reach the Nash equilibrium point (when network-level routing is static). The authors
evaluate the performance of the network only after the system reaches equilibrium.
This approach is based on two assumptions: (i) The system has a Nash equilibrium
point and it is reachable, and (ii) overlay networks and the IP network take turns
at playing the game. Also, the work ignores a wide variety of dynamics (due to
events like link/node failures, congestions, and software bugs) that occur in the real-
world networks. Zhang et al. [44] and Liu et al. [28] model the interaction between
overlay routing and traffic engineering (TE) as a two-player game, where the overlay
attempts to minimize its delay and the TE tries to minimize network cost. They argue
that the lack of common objective for the overlay and IP networks could result in
poor performance. In summary, selfish overlay routing can degrade the performance
of the network as a whole. Overlay routing never improves TE performance. The
average cost inflation suffered by TE depends on the fraction of overlay traffic in
the network. Studies show that the maximum cost and variation occurs when half of
the network demand is overlay traffic. On the other hand, the impact on TE cost is
reduced when link capacity increases.
In [24], we examined the interaction between the two layers of control from an ISP’s
view, with emphasis on system dynamics before it reaches the equilibrium state.
Instead of static network layer routing, we are interested in what happens when
both the overlay and the IGP protocols dynamically recompute routes in response to
external triggers such as link/router failures, flash crowds, and network congestion.
We will briefly summarize the problematic interactions that can occur between IP
and overlay networks in this context (as identified in [24])
Increased dynamism in traffic matrices. A traffic matrix (TM) specifies the
traffic demand from origin nodes to destination nodes in a network, and hence is
a critical input for many traffic engineering tasks (e.g., capacity planning and link
weight setting). Conventionally, overlay nodes typically encapsulate the next hop
information in the packet header and hence the IP layer is unaware of the true fi-
nal destination. Therefore, overlay routing dynamics can introduce big shifts and
duplications of TM entries in a very short timescale, making TM more dynamic
and ambiguous, and harder to estimate. This is illustrated in Figure 8.2(a) and
(b), which shows an overlay network spanning single and multiple domains, re-
164 C.-N. Chuah and R. Keralapura
F
5 G 5
H
5 5 5
8
M 5 N
A E
20 5
8 5 5
B
D
5 C 5
Overlay Paths
b Default BGP Path
Overlay Node
Exit Node in the AS
AS 2
B
AS 1 AS 4
A
C
AS 3
Fig. 8.2 (a) Overlay network contained within an AS and (b) overlay network spanning multiple
AS domains. (Same illustration is used in [24])
spectively. In Figure 8.2(a), the TM entry for the source–destination pair AD
is 10 as a result of IP-routing decisions. However, if overlay routing intervenes
and redirects traffic through an overlay path (say, A B D), then the TM entry
for the pair A D is duplicated as two entries of 10 units each, one for A B
and another for B D, while the value for the entry A D is 0. This change
could happen at the time instant when the overlay network decides to change the
traffic path. Given that ISPs typically estimate their TM at coarser timescales (in
the order of several minutes or hours), such a sudden shift in TM could introduce
errors in the ISP’s estimated TM, resulting in poor network management policies.
In the example shown in Figure 8.2(b), suppose that the path from AS1 to AS 4
dictated by the IP layer is through AS 3. If the overlay network redirects traffic
through AS 2, the traffic will appear to leave AS1 via a different exit point. The
TM entry in AS1 for A C now shifts to A B.
Bypassing the ISP’s load balancing policies. A common practice for ISPs to
manage the traffic is changing the IGP link weights. ISPs make two assumptions
while using this technique: (i) traffic demands do not vary significantly over short
timescales, and (ii) changes in the path within a domain do not impact traffic
8 Overlay Networks: Applications, Coexistence, and Dynamics 165
The simplicity and feasibility of designing overlay networks has attracted several
service providers to adopt this approach to deliver new services. As a consequence,
several new overlay networks are getting deployed on top of the same underly-
ing IP networks in the Internet. These overlay networks are typically unaware of
the existence of other overlay networks (i.e., their node locations, routing and fail-
over strategies, optimization metrics, etc.). By allowing end hosts in these overlay
networks to make independent routing decisions at the application level, different
overlay networks may unintentionally interfere with each other. These interactions
could lead to suboptimal states in all of the involved overlay networks. In this sec-
tion we explore two kinds of such interactions. The first kind of interactions are
those interactions between overlay networks that result in suboptimal and unex-
pected equilibrium conditions in all of the involved overlay networks. The second
8 Overlay Networks: Applications, Coexistence, and Dynamics 167
kind of interactions are transient interactions that capture how different protocols in
different overlay networks can interfere with each other resulting in oscillations (in
both route selection and network load) and cascading reactions, thus affecting the
performance of the overlay network traffic.
Traditional overlay routing has been selfish in nature, i.e., most overlay networks
try to find the best path (in terms of delay, loss, and/or throughput) and route their
traffic along this best path. As we discussed earlier, such an approach will result in
performance degradation in all the overlays involved. In [21], the authors examine if
there is an optimal overlay routing strategy that can provide better performance than
the traditional selfish routing. In fact, the authors propose a routing strategy called
optimal overlay routing, where every overlay network decides to route its own traffic
along one or more paths in its network by minimizing an objective function (e.g.,
the weighted average delay along the paths). They model this as a non-cooperative
strategic game and show that there always exists a Nash equilibrium point. They
also show that such an equilibrium point is not Pareto optimal.1 Such an optimal
routing scheme will provide significant performance improvement over traditional
selfish routing, but overlays could encounter fairness issues (one overlay network
getting a bigger share of resources than the others). The work proposes two pricing
models to fix this issue, and shows that if overlay networks adopt this strategy it
could be beneficial to all the overlay networks.
In this section, we highlight how multiple overlay network protocols can interact
with each other and the corresponding dynamics during the transient period. We
begin by describing the generic overlay routing mechanism that we consider in the
rest of our discussion.
Most overlay routing strategies select a path between a source–destination pair
with the best performance based on end-to-end delay, throughput, and/or packet
loss. Similar to our previous studies [22, 23], we assume that the overlay path with
the shortest end-to-end delays will be selected (but this can be extended to include
other metrics). Overlay networks monitor the actively used paths by sending fre-
quent probes to check if the paths adhere to acceptable performance bounds. If the
probing event detects a problematic path (due to failures, congestion, etc. at the IP
1
An outcome of a game is Pareto optimal if there is no other outcome that makes every player at
least as well off and at least one player strictly better off. That is, a Pareto Optimal outcome cannot
be improved upon without hurting at least one player.
168 C.-N. Chuah and R. Keralapura
layer), then the overlay network sends probes at a higher rate to confirm the prob-
lem before selecting an alternate path. We assume that regular probes are sent out
every P seconds. If a probe does not receive a response within a given timeout (or
T ) value, then the path is probed at a higher rate (every Q seconds). If a path re-
mains bad after N such high-frequency probes, the overlay will find an alternate
path (or the next best path) between the source and destination nodes. For instance,
RON [12] can be modeled with P = 12s, Q = 3s, and N = 3, while the Akamai
network can be modeled with much smaller values of P , Q, and N [27]. As soon
as an alternate path is found, the traffic is moved to the alternate path, which is now
probed every P seconds to ensure that it is healthy.2
Using this generic overlay routing model described above, different overlay
networks with different routing strategies can be simulated. Our previous stud-
ies [22, 23] based on a realistic ISP topology have shown that transient interactions
between multiple overlays could result in two types of race conditions: traffic os-
cillations and cascading reactions. In the remainder of this section, we will briefly
summarize our findings, in particular, on how interactions lead to traffic oscillations,
an analytical model for the synchronization probability between two overlays and
insights gained through it, and the various strategies for reducing the impact of race
conditions.
Traffic oscillations refer to the network state where the traffic load between certain
source–destination pairs in different overlay networks start oscillating between two
or more alternate paths. From the perspective of the underlying IP network, the loads
on some of the IP links constantly change, affecting the non-overlay traffic on these
links. This constant fluctuation of traffic load occurs at small timescales resulting in
unstable network conditions. Figure 8.3 shows one of our simulation results from
[22] that illustrate traffic oscillations on some of the links in a realistic tier-1 ISP
backbone topology. These oscillations continue until a stop trigger stops these os-
cillations. These stop triggers could be events like IGP protocol convergence, a link
failure, link recovery after a failure, etc. However, an important observation here
is that certain events that act as stop triggers for oscillations at some point in time
might not affect the oscillations at another point in time. Also, most of the events
that act as stop triggers are heavily dependent on the network conditions at the IP
layer. The order of occurrence of these stop triggers is not deterministic, thus in-
troducing unpredictability in the duration of the oscillations. In essence, the end of
oscillations depends on numerous factors, thus making it non-trivial to accurately
estimate the impact of oscillations on overlay or non-overlay traffic.
2
As long as the current path adheres to the performance bounds, an overlay does not shift traffic
to an alternate path even if the alternate path starts to exhibit better performance.
8 Overlay Networks: Applications, Coexistence, and Dynamics 169
1
Link 2−5
Traffic Oscillations
0.9 Link 0−1
Link 0−2
0.8 Link 10−12
Link 10−15
0.7 Link 10−11
Link Utilization
Link 0−4
0.6 Link 2−6
Link 12−18
0.5
0.4
0.3
0.2 Link Recovery Link Recovery
B F K
C G L
A
D H
Fig. 8.4 Two overlay networks that partially share primary and alternate paths [22]
Traffic oscillations are initiated when the following conditions are satisfied [22]:
Presence of External Trigger A network event that perturbs the network state
will trigger overlay networks to search for alternate paths. This event can be a
link/node failure or sudden increase in traffic demand that leads to performance
degradation on the original path.
Sharing Primary and Backup Paths The topologies of coexisting overlay
networks determine how the different overlay paths overlap in terms of the un-
derlying physical resources. Synchronization between two overlays occurs when
there is a pair of overlay nodes in two different overlay networks, such that
they share bottleneck link/s in both their first and second alternate path choices.
Figure 8.4 illustrates this case with overlays on top on an IP network. The node
170 C.-N. Chuah and R. Keralapura
Here, we will give the intuition for the analytical model derived in [22]. As de-
scribed earlier, overlay networks probe their paths at regular intervals of P seconds.
If the path is healthy, the probe should return in one round trip time, with a measure
of the path delay (or an assessment of another chosen metric). If the probe does
not return before the timeout T expires, then the overlay starts sending its high-
8 Overlay Networks: Applications, Coexistence, and Dynamics 171
frequency probes (N will be sent) every Q seconds. Thus, the probing procedure
for each overlay i on path j is specified by five parameters: the probe interval Pi ,
the high-frequency probe interval Qi , the timeout Ti , the number of high-frequency
probes Ni , and the round trip time Rij over path j . Note that Ti is the same for low-
and high-frequency probes. By definition Pi Qi Ti Rij .
The probing procedure implies that (under normal circumstances) on a given path
there will be exactly one probe in every time period of length P . Now suppose that
an event (e.g., a link failure) occurs at time tl . We assume that a probe sent on path
j in overlay i at time t0 “senses” the state of the path at t0 C Rij =2, i.e., the probe is
dropped if the path is not operational at that time.3 Hence, the overlay network will
detect the failure event with the probe sent at t0 if t0 2 Œtl Rij =2; tl Rij =2 C Pi .
We call this period the detection period. The overlay will then react at time t0 C Ti
sending the high-frequency probes as discussed above.
Consider two overlay networks, O1 and O2 . Let t1 and t2 be the actual times
at which the initial probes are sent during the detection period. We assume that t1
and t2 are equally likely to occur anywhere in their detection period and hence are
uniformly distributed in their detection period. Once an overlay network detects the
failure, it begins sending the high-frequency probes every Qi time units. The final
high-frequency probe will be sent out at fi D ti C Ni Qi for i D 1; 2. There are
two cases for synchronization – in one case O1 moves its traffic first and O2 moves
shortly thereafter, or vice versa. We can mathematically express this as
Since the above two conditions are independent of each other, we can combine
them as follows:
T1 < f1 f2 < T2 ;
T1 < .t1 C N1 Q1 / .t2 C N2 Q2 / < T2 ;
b < t1 t2 < a: (8.3)
where a D N2 Q2 N1 Q1 C T2 ; b D N2 Q2 N1 Q1 T1 :
Assuming that t1 and t2 can occur anywhere in their detection period with a
uniform probability, we can represent the system as a two-dimensional graph with
the x-axis representing probe t1 and the y-axis representing probe t2 . This geomet-
ric representation allows us to compute the probability of synchronization, P .S /,
of two overlays in an intuitively simple way. We define region of conflict to be the
portion of this rectangle in which synchronization will occur, i.e., the region that sat-
isfies the two constraints specified in Equation 8.3. The boundaries of the region of
3
To simplify our analysis during failures we ignore the exact values of propagation delays between
the source, the failed spot, and destination. Thus we approximate the instant at which a probe is
dropped by Rij =2.
172 C.-N. Chuah and R. Keralapura
Region of conflict
(P1-V1, P1-V1-a)
t1–t2= b
t1 – t 2 = a
(a-V2, -V2) (P1-V1, -V2)
(-V1, -V2)
(0, -V2)
Total Area, A = P1P2
Area of Region of Conflict, AC = A – A1– A2
conflict are thus determined by the boundaries of the rectangle and their intersection
with the two parallel lines of slope 1 (i.e., t1 t2 D a and t1 t2 D b). The proba-
bility of synchronization of two overlays, P .S /, can be defined to be the ratio of the
area of the region of conflict to the total area of the rectangle. We can see one spe-
cific scenario of this geometric representation in Figure 8.5. This two-dimensional
representation captures the influence of all the parameters .Pi ; Qi ; Ni ; Ti ; Ri / since
these quantities ultimately define all the corners and line intersection points needed
to compute the relevant areas. We can clearly see that the area A of the rectangle is
composed of three distinct regions: A1 (area of the region below Line-1: t1 t2 D a
and the rectangle boundaries), A2 (area of the region above Line-2: t1 t2 D b and
the rectangle boundaries), and the region of conflict. Hence, the region of conflict,
AC , can be expressed as AC D A A1 A2 :
Thus we can express the probability of synchronization as
AC
P .S / D Probability.b < t1 t2 < a/ D (8.4)
A
There are a number of ways in which the two lines can intersect the boundaries of
the rectangle [22], but are not shown here. Although this model results in nine dif-
ferent scenarios, each of them with a different equation for P .S /, it is still attractive
since it is conceptually very simple. Our formulation indicates that the probabil-
ity of synchronization is non-negligible across a wide range of parameter settings
(P , Q, T , and N ), thus implying that the ill-effects of synchronization should not
be ignored.
8 Overlay Networks: Applications, Coexistence, and Dynamics 173
1
α1 = 1
α1 = 0.5 R1 = R2 = 0.1s
0.8 α1 = 0.25 T1 = 4*R1
α1 = 0.17 T2 = 4*R2
Q1 = T1
Theoretical P(S)
α1 = 0.1 Q2 = T2
0.6 α1 = 0.07
α1 = 0.05
0.4
0.2
0
10−1 100
α2 (Aggressiveness Factor of the second overlay network)
Fig. 8.6 Proportional parameter overlays with mixed aggressiveness and a chosen value of
RTT [22]
For illustration purposes, let us consider a scenario where two overlay networks
use identical probing parameters, and R1 D R2 D R. In this case probability
of synchronization collapses to the simple equation P .S / D T .2P T /=P 2 . If
our model is correct, this implies that P .S / depends only on the probe interval
and time out value, which was confirmed through simulation [22]. The maximum
value of P .S / D 1 occurs when T D P , i.e., the overlay networks will definitely
synchronize. To decrease the probability of synchronization to less than 0.05 (i.e.,
5% chance of synchronization) we need to set P 40T . This observation moti-
vated us to characterize an overlay network based on its aggressiveness factor, ˛,
defined as the ratio of the timeout and probe interval, ˛i D Ti =Pi . We consider
overlays that probe frequently and move their traffic quickly as aggressive. Note
that RTT T P , hence 0 < ˛ 1. For two identical overlay networks we have
P .S / D 2˛ ˛ 2 , which shows that as the networks increase their aggressiveness
(i.e., as ˛ ! 1), P .S / increases.
Figure 8.6 [22] shows how P .S / varies as a function of the aggressiveness fac-
tors of the two overlays. Each curve in the graph represents the value of P .S / for a
fixed value of T1 =P1 but different values of T2 =P2 . We can clearly see that as the
aggressiveness of both the overlays increases, there is a higher chance of synchro-
nization. P .S / decreases significantly when the overlays are non-aggressive. This
confirms that as long as one of the overlays is non-aggressive, the probability of syn-
chronization is low. In other words, setting a high value of P is critical to reducing
P .S /. However, we wish to point out that there could be fairness issues when one
overlay is very aggressive, and exploits the non-aggressive parameter settings of the
other overlay. Even in the case of one aggressive and one non-aggressive overlay
network, we found that P .S / can still be non-negligible for a wide range of relative
RTT values.
174 C.-N. Chuah and R. Keralapura
As the number of overlays in the Internet increases, the possibility of their inter-
actions also increases. Similar to the work in [21], the authors in [42] also try to
address the problem of performance degradation when multiple overlays coexist by
modeling it as a dynamic auction game. The problem that the authors address is the
following: when multiple overlay streams contend for resources at the same peer,
then the performance of all the streams deteriorates. The approach that the authors
propose is to let the downstream peer play a dynamic auction game with the up-
stream peer to decide the bandwidth allocation for each of the flows.
To limit the impact of interactions highlighted earlier in this section, in particular
the problem of synchronization among multiple overlay networks, we can take two
approaches: (i) reduce the probability of synchronization among overlays, and/or
(ii) reduce the number of oscillations once the overlays get synchronized.
Intuitively, one way to make it less likely that two overlays actually get syn-
chronized is to increase the probing intervals to a large value (i.e., probe less
aggressively). But this would defeat the purpose of overlay networks to react quickly
to performance degradation events. A second approach would be to add randomness
into the probing procedure. The idea of adding randomness was illustrated to help
in several cases like periodic routing protocols [18]; however, in the case of overlay
networks, the probability of synchronization depends on the difference terms (like
N1 Q1 N2 Q2 ) and randomness added to the same parameters in two overlays could
either increase or decrease the probability of synchronization [23]. Hence, adding
randomness does not always ensure that two overlays are less likely to synchronize.
In order to reduce the number of oscillations after a synchronization event, an
approach similar to the well-known behavior of TCP can be used. Whenever a flow
using TCP experiences a packet loss due to congestion the protocol backs off from
using an aggressive packet transfer rate. Typically this back-off occurs at an expo-
nential rate to reduce the impact of congestion. In the case of multiple overlays, a
similar back-off technique can be used where an overlay network successively in-
creases the reaction time each time it decides to switch routes between the same
source and destination nodes (if the reactions occur in a small time interval). This is
similar in spirit to damping i.e., slow down the reaction time of a protocol to avoid
responding too quickly. Note that the back-off technique is also similar to the idea
of non-aggressive probing. The main difference is that while using non-aggressive
probing, the parameter (or timer) values are always large, but while using back-
off strategy the parameter values are increased only when oscillations are detected.
A more comprehensive analysis of all the above techniques can be found in [23].
8.5 Summary
Over the past few years, application service providers have started using the Internet
as a medium to deliver their applications to consumers. For example, Skype [10]
and Joost [5], among others, are now using the Internet to provide voice and video
8 Overlay Networks: Applications, Coexistence, and Dynamics 175
services to users. There are two main reasons for this popularity: (i) Internet is be-
coming more and more pervasive, a much larger audience can be reached, and (ii) It
is much more economical to use the shared IP networks as the transport medium
than the traditional dedicated telephone lines. With this popularity, service providers
find it necessary to provide QoS guarantees for their applications. Given the prob-
lems in IGP and BGP convergence, and the lack of stringent QoS metrics by ISPs,
application service providers started building application-layer overlay networks
that guaranteed certain performance. These overlay networks have several attractive
features: (i) very simple to design, (ii) flexible enough to be used for several differ-
ent applications, (iii) cheaper than installing an IP layer infrastructure, (iv) easy to
deploy and manage, and (v) gives more control over application performance (rout-
ing, failure resiliency, etc.). In this chapter, we highlight how overlay networks work
along with a brief taxonomy, and describe several applications for which they are
currently being used.
The attractive features of overlay networks come with several disadvantages as
well. As described in this chapter, overlay networks could interact with the un-
derlying IP networks leading to problems both in the equilibrium state and during
transient convergence. The equilibrium state can be analyzed by modeling the in-
teractions as non-cooperative game. Several works in the literature show that the
equilibrium is not optimal, and could affect the performance of both overlay and
IP networks. We also discussed in detail how multiple coexisting overlay networks
could interact with each other, resulting in traffic oscillations, and possible remedies
for these interactions.
Acknowledgements The research on the interactions between overlay and IP layer routing was
supported by NSF CAREER Award No. 0238348. We also thank our collaborators, Dr. Nina Taft
and Dr. Gianluca Iannaccone, at Intel Research Berkeley for their invaluable technical input.
References
1. Akamai. http://www.akamai.com.
2. Bittorrent. http://www.bittorrent.com/.
3. CNN. http://www.cnn.com.
4. Gnutella. http://www.gnutella.com.
5. Joost. http://www.joost.com/.
6. Kazaa. http://www.kazaa.com.
7. Napster. http://www.napster.com.
8. P4P Working Group. http://www.pandonetworks.com/p4p.
9. Planetlab. http://www.planet-lab.org/.
10. Skype. http://www.skype.com/.
11. Yahoo Inc. http://www.yahoo.com.
12. D. Anderson, H. Balakrishna, M. Kaashoek, and R. Morris. Resilient Overlay Networks. In
Proceedings of ACM Symposium on Operating Systems Principles, Oct. 2001.
13. R. Braynard, D. Kostic, A. Rodriguez, J. Chase, and A. Vahdat. Opus: An Overlay Peer Utility
Service. In Proceedings of IEEE Open Architectures and Network Programming (OpenArch),
June 2002.
176 C.-N. Chuah and R. Keralapura
9.1 Introduction
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 181
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 9,
c Springer-Verlag London Limited 2010
182 A. Kirsch et al.
on two key application areas for high-speed routers: hash tables and related data
structures, and hash-based schemes for network measurement. While we aim to be
comprehensive, given the huge recent growth of interest in this area, this survey
should be considered a guide to the literature rather than a full account.
Before diving into the literature, we offer some of our high-level perspective that
guides this survey. First, as will become apparent in the body of this chapter, there
is an enormous amount of potential for interplay between theoretical and applied
techniques. It follows that the relevant literature spans the spectrum from very the-
oretical to very applied. We aim our attention on the middle of this range, but we
emphasize that any particular piece of work must be considered relative to its place
in the spectrum. For instance, when we discuss hash table designs for high-speed
routers, we consider several theoretical papers that focus on the design of hash ta-
bles generally, rather than on some particular application. In such work, the primary
goal is often to present and evaluate general data structure design principles that
appear to have broad potential to impact the implementation of practical systems.
The evaluation in these works usually takes the form of establishing various guar-
antees on the data structure outside the context of any particular application. For
example, a work in this vein that proposes a novel hash table design may place sig-
nificant emphasis on the sorts of theoretical and numerical guarantees that can be
obtained under the new design, with simulations serving in a mostly supporting role.
Naturally, then, a major challenge in this area is to design data structures that are
amenable to a compelling evaluation of this sort. Of course, since this approach is
very general, it typically does not speak directly to how such a data structure might
perform when appropriately adjusted and implemented in a real application. When
properly interpreted, however, the results of these more theoretical works can be
highly suggestive of increased performance for a broad range of settings.
Similarly, very applied works should be considered with respect to the concrete
results that they demonstrate for the specific application of interest. Works in the
middle of the spectrum typically should be considered with respect to some com-
bination of these goals, for instance showing that a particular theoretical intuition
seems to lead to compelling results for some class of related applications.
In the rest of the survey, we first give the necessary background and history in
Section 9.2. We then consider three fairly broad application settings: hash table
lookups for various hardware memory models (Sections 9.3 and 9.4), Bloom filter-
type applications (Section 9.5), and network measurement (Section 9.6).
9.2 Background
We review some of the key concepts underlying the hash-based data structures com-
monly proposed for high-speed packet processing. We describe the performance
measures relevant to these applications and the resulting hardware models, and also
give a brief history of the earlier literature on these applications.
9 Hash-Based Techniques for High-Speed Packet Processing 183
We begin with a brief review of the relevant history, constructions, and issues in the
design of hash-based data structures. We describe some of the tension between the
theory and practice of hash functions that informs our analyses, review the standard
Bloom filter data structure and its variants, and discuss multiple-choice hash tables.
1
Pr.h.x/ D h.y// :
jV j
That is, the probability of a collision between any pair of items after being hashed
is at most what it would be for a fully random hash function, where this probability
is taken over the choice of hash function. A family of hash functions is said to be
strongly 2-universal, or more commonly in modern terminology pairwise indepen-
dent, if, for every pair of distinct items x; y 2 U and any x 0 ; y 0 2 V , we have
1
Pr.h.x/ D x 0 and h.y/ D y 0 / D :
jV j2
184 A. Kirsch et al.
That is, the behavior for any pair of distinct items is the same as for a fully random
hash function. Historically, in some cases, the term universal is used when strongly
universal is meant. Pairwise independence generalizes naturally to k-wise indepen-
dence for collections of k items, and similarly one can consider k-universal hash
functions, although generally k-wise independence is more common and useful.
More information can be found in standard references such as [52].
Since Carter and Wegman’s original work [10], there has been a substantial
amount of research on efficient constructions of hash functions that are theoretically
suitable for use in data structures and algorithms (e.g., [21, 54, 62] and references
therein). Unfortunately, while there are many impressive theoretical results in that
literature, the constructed hash families are usually impractical. Thus, at least at
present, these results do not seem to have much potential to directly impact a real
implementation of hash functions.
Fortunately, it seems that in practice simple hash functions perform very well.
Indeed, they can be implemented very efficiently. For example, Dietzfelbinger
et al. [20] exhibit a hash function that can be implemented with a single multi-
plication and a right shift operation and is almost universal. For scenarios where
multiplications are undesirable, Carter and Wegman’s original work [10] provides a
universal hash function that relies on XOR operations. Some practical evaluations
of these hash functions and others, for both hardware and software applications
(including Bloom filters, discussed in Section 9.2.1.2), are given in [26, 59–61, 67].
Overall, these works suggest that it is possible to choose very simple hash functions
that work very well in practical applications.
There is also theoretical work that strives to explain why simple hash functions
seem to perform well in practice. One common approach is to examine a partic-
ular theoretical analysis that uses the assumption of fully random hash functions,
and then attempt to modify the analysis to obtain a comparable result for a class of
simple hash functions (e.g., universal hash functions), or a particular family of hash
functions. For instance, partitioned Bloom filters (described in Section 9.2.1.2) can
be implemented with any universal hash function, albeit with a small increase in
the false positive probability. As an example of this technique that works only for
a specific hash family, Woelfel [76] shows that one can implement d -left hashing
(described in Section 9.2.1.3) using a particular type of simple hash function. In a
different direction, Mitzenmacher and Vadhan [53] show that for certain applica-
tions, if one is willing to assume that the set of items being hashed satisfies certain
randomness properties, then any analysis based on the assumption that the hash
functions are fully random is also valid with universal hash functions (up to some
small, additional error probability). From a practical perspective, this work shows
that it may be possible to construct some sort of statistical test that would provide a
theoretical explanation for how well applications built on simple hash functions will
work on a particular source of real data. Alternatively, if one is willing to assume that
the set of items being hashed has a certain amount of entropy, then one can expect
the same performance as derived from an analysis with fully random hash functions.
Having reviewed the approaches to hashing most related to this work, we now ar-
ticulate our perspective on hash functions. This is essentially just the standard view
9 Hash-Based Techniques for High-Speed Packet Processing 185
in the networking literature, but it bears repeating. Since we are primarily concerned
with real-world systems, and since it is usually possible to choose a simple, practi-
cal hash function for an application that results in performance similar to what we
would expect for a fully random hash function, we allow ourselves to assume that
our hash functions are fully random in our theoretical analyses. Thus, we take the
perspective of modeling the hash functions for the sake of predicting performance
in a statistical sense, as opposed to explicitly constructing the hash functions to sat-
isfy concrete theoretical guarantees. Furthermore, since we assume that simple hash
functions work well, we generally do not think of the cost of hashing as a bottleneck,
and so we often allow ourselves to use hash functions liberally.
A Bloom filter [2] is a simple space-efficient randomized data structure for repre-
senting a set in order to support membership queries. We begin by reviewing the
fundamentals, based on the presentation of the survey [8], which we refer to for fur-
ther details. A Bloom filter for representing a set S D fx1 ; x2 ; : : : ; xn g of n items
from a large universe U consists of an array of m bits, initially all set to 0. The filter
uses k independent (fully random) hash functions h1 ; : : : ; hk with range f1; : : : ; mg.
For each item x 2 S , the bits hi .x/ are set to 1 for 1 i k. (A location can be
set to 1 multiple times.) To check if an item y is in S , we check whether all hi .y/
are set to 1. If not, then clearly y is not a member of S . If all hi .y/ are set to 1, we
assume that y is in S , and hence a Bloom filter may yield a false positive.
The probability of a false positive for an item not in the set, or the false positive
probability, can be estimated in a straightforward fashion, given our assumption
that the hash functions are fully random. After all the items of S are hashed into the
Bloom filter, the probability that a specific bit is still 0 is
p 0 D .1 1=m/kn ekn=m :
and so, asymptotically, the performance is the same as the original scheme. In prac-
tice, however, the partitioned Bloom filter tends to perform slightly worse than the
non-partitioned Bloom filter. This is explained by the observation that
1 kn k n
1 > 1 ;
m m
when k > 1, so partitioned filters tend to have more 1’s than non-partitioned filters,
resulting in larger false positive probabilities.
We also point out that, in some cases, memory considerations may make alter-
native approaches for setting the bits of a Bloom filter more attractive. If one must
bring in a page or a cache line to examine the bits in a Bloom filter, then examining
k random bits may be too expensive. One can instead associate with an item k ran-
dom bits from a page or a smaller number of cache lines. This idea originated with
work of Manber and Wu [49], but variations are commonly explored (e.g., [58]).
The standard Bloom filter naturally supports insertion operations: to add a new
item x to the set represented by the filter, we simply set the corresponding bits of
the filter to 1. Unfortunately, the data structure does not support deletions, since
changing bits of the filter from 1 to 0 could introduce false negatives. If we wish to
support deletions, we can simply replace each bit of the filter with a counter, initially
set to 0. To insert an item x into the filter, we now increment its corresponding
counters h1 .x/; : : : ; hk .x/, and to delete an item known to be in the set represented
by the filter, we decrement those counters. To test whether an item y is in S , we
9 Hash-Based Techniques for High-Speed Packet Processing 187
can simply check whether all the counters h1 .y/; : : : ; hk .y/ are positive, obtaining
a false positive if y 62 S but none of the counters are 0.
This Bloom filter variant is called a counting Bloom filter [26]. Clearly, all of our
prior analysis for standard Bloom filters applies to counting Bloom filters. How-
ever, there is a complication in choosing the number of bits to use in representing
a counter. Indeed, if a counter overflows at some point, then the filter may yield a
false negative in the future. It is easy to see that the number of times a particular
counter is incremented has distribution Binomial.nk; 1=m/ Poisson.nk=m/ D
Poisson.ln 2/, by the Poisson approximation to the binomial distribution (assuming
k D .m=n/ ln 2 as above). By a union bound, the probability that some counter over-
flows if we use b-bit counters is at most mPr.Poisson.ln 2/ 2b /. As an example,
for a sample configuration with n D 10;000, m D 80;000, k D .m=n/ ln 2 D 8 ln 2,
and b D 4, we have f D .1=2/k D 2:14% and mPr.Poisson.ln 2/ 2b / D
1:78 1011 , which is negligible. (In practice k must be an integer, but the point is
clear.) This sort of calculation is typical for counting Bloom filters.
One could also use counting Bloom filters to represent multisets. Again, when
a copy of an element x in inserted, we increment its corresponding counters
h1 .x/; : : : ; hk .x/, and to delete a copy of an item known to be in the set represented
by the filter, we decrement those counters. We can test whether an item y occurs in
S with multiplicity at least ` 1 by testing whether the counters h1 .y/; : : : ; hk .y/
are at least `, with some probability of a false positive.
We now describe a variant of counting Bloom filters that is particularly useful
for high-speed data stream applications. The data structure is alternately called a
parallel multistage filter [24] or a count-min sketch [13] (the paper [24] applies
the data structure to network measurement and accounting, while Cormode and
Muthukrishnan [13] show how it can be used to solve a number of theoretical
problems in calculating statistics for data streams). The input is a stream of up-
dates .it ; ct /, starting from t D 1, where each item it is a member of a universe
U D f1; : : : ; ng, and each count ct is an integer. The state of the system at time T is
given by a vector a.T / D .a1 .T /; : : : ; an .T //, where aj .T / is the sum of all ct for
which t T and it D j . The input is typically guaranteed to satisfy the condition
that aj .T / > 0 for every j and T . We generally drop the T when the meaning is
clear.
The structures consist of a two-dimensional array Count of counters with width
w and depth d : CountŒ1; 1; : : : ; CountŒd; w. Every entry of the array is initialized
to 0. In addition, there are d independent hash functions h1 ; : : : ; hd W f1; : : : ; ng !
f1; : : : ; wg. (Actually, it is enough to assume that the hash functions are universal,
as shown in [13]; the argument below also holds with this assumption.) To process
an update .i; c/, we add c to the counters CountŒ1; h1 .i /; : : : ; CountŒd; hd .i /. Fur-
thermore, we think of aO i D minj 2f1;:::;d g CountŒj; hj .i / as being an estimate of ai .
Indeed, it is easy to see that aO i ai (using the assumption that aj > 0 for every j ).
We now derive a probabilistic upper bound on aO i . For j 2 f1; : : : ; d g, let
X
Xi;j D ai 0 :
i 0 ¤i Whj .i 0 /Dhj .i /
188 A. Kirsch et al.
P
Since the hash functions are fully random, EŒXi;j kak=w, where kak D k ak
(the L1 norm of a, assuming all of the entries in a are non-negative). Markov’s
inequality [52] then implies that for any threshold value > 0, we have Pr.Xi;j
/ kak=w. Now we note that aO i D ai C minj 21;:::;d Xi;j and use independence
of the hj ’s to conclude that
d
kak
Pr.aO i ai C / :
w
increasingly important for many applications (see, for instance, the survey [8]). As
just a partial listing of additional examples of the proliferation of Bloom filter vari-
ations, compressed Bloom filters are optimized to minimize space when transmitted
[50], retouched Bloom filters trade off false positives and false negatives [22], and
approximate concurrent state machines extend the concept of a Bloomier filter by
tracking the dynamically changing state of a changing set of items [3]. Although
recently more complex but asymptotically better alternatives have been proposed
(e.g., [4, 55]), the Bloom filter’s simplicity, ease of use, and excellent performance
make it a standard data structure that is, and will continue to be, of great use in many
applications.
The canonical example of a hash table is one that uses a single hash function (which
is assumed to be fully random) with chaining to deal with collisions. The standard
analysis shows that if the number of buckets in the table is proportional to the num-
ber of items inserted, the expected number of items that collide with a particular
item is constant. Thus, on average, lookup operations should be fast. However, for
applications where such average case guarantees are not sufficient, we also need
some sort of probabilistic worst-case guarantee. Here, the qualifier that our worst-
case guarantees be probabilistic excludes, for instance, the case where all items in
the table are hashed to the same bucket. Such situations, while technically possible,
are so ridiculously unlikely that they do not warrant serious consideration (at least
from a theoretical perspective). As an example of a probabilistic worst-case guar-
antee, we consider throwing n balls independently and uniformly at random into n
bins. In this case, a classical result (e.g., [52, Lemmas 5.1 and 5.12] or the origi-
nal reference by Gonnet [32]) shows that the maximum number of balls in a bin is
..log n/= log log n/ with high probability. This result translates directly to a prob-
abilistic worst-case guarantee for a standard hash table with n items and n buckets:
while the expected time to look up a particular item is constant, with high proba-
bility the longest time that any lookup can require is
..log n/= log log n/. Similar
results hold for other hashing variations, such as linear probing, where each item
is hashed to a bucket, each bucket can hold one item, and if a bucket already con-
tains an item, successive buckets are searched one at a time until an empty bucket
is found [39]. A standard result is that when ˛n items are placed into n buckets, the
expected time to look up a particular item is constant, but with high probability the
longest time that any lookup can require is
.log n/. It is worth pointing out that,
for certain memory layouts, where several buckets can fit on a single cache line, the
locality offered by linear probing may offer performance advantages beyond what
is suggested by these asymptotics.
The chained hashing example illustrates a connection between hashing and bal-
anced allocations, where some number of balls is placed into bins according to some
probabilistic procedure, with the implicit goal of achieving an allocation where
the balls are more-or-less evenly distributed among the bins. In a seminal work,
190 A. Kirsch et al.
Azar et al. [1] strengthened this connection by showing a very powerful balanced
allocation result: if n balls are placed sequentially into m n bins for m D O.n/,
with each ball being placed in one of a constant d 2 randomly chosen bins with
minimal load at the time of its insertion, then with high probability the maximal load
in a bin after all balls are inserted is .ln ln n/= ln d CO.1/. In particular, if we modify
the standard hash table with chaining from above to use d hash functions, inserting
an item into one of its d hash buckets with minimal total load, and performing a
lookup for an item by checking all d of its hash buckets, then the expected lookup
time is still constant (although larger than before), but the probabilistic worst-case
lookup time drops exponentially. This scheme, usually called d -way chaining, is
arguably the simplest instance of a multiple choice hash table, where each item is
placed according to one of several hash functions.
Unsurprisingly, the impact of [1] on the design of randomized algorithms and
data structures, particularly hash tables and their relatives, has been enormous. For
details and a more complete list of references, we refer to the survey [51]. Before
moving on, however, we mention an important improvement of the main results
in [1] due to Vöcking [73]. That work exhibits the d -left hashing scheme, which
works as follows. There are n items and m buckets. The buckets are partitioned into
d groups of approximately equal size, and the groups are laid out from left to right.
There is one hash function for each group, mapping the items to a randomly chosen
bucket in the group. The items are inserted sequentially into the table, with an item
being inserted into the least loaded of its d hash buckets (using chaining), with ties
broken to the left. Vöcking [73] shows that if m D n and d 2 is constant, then the
maximum load of a bucket after all the items are inserted is .ln ln n/=d ln d CO.1/,
where d is the asymptotic growth rate of the d th order Fibonacci numbers. (For
example, when d D 2, d is the golden ratio 1:618 : : :.) In particular, this improves
the factor of ln d in the denominator of the .ln ln n/= ln d C O.1/ result of Azar
et al. [1]. Furthermore, Vöcking [73] shows that d -left hashing is optimal up to an
additive constant. Interestingly, both the partitioning and the tie-breaking together
are needed to obtain this improvement.
Both d -way chaining and d -left hashing are practical schemes, with d -left hash-
ing being generally preferable. In particular, the partitioning of the hash buckets into
groups for d -left hashing makes that scheme more amenable to a hardware imple-
mentation, since it allows for an item’s d hash locations to be examined in parallel.
For high-speed packet-processing applications, however, hashing schemes that re-
solve collisions with chaining are often undesirable. Indeed, for these applications,
it is often critical that almost everything be implemented cleanly in hardware, and
in this case the dynamic memory allocation requirements of hashing schemes that
use chaining are problematic. Thus, we prefer open-addressed hash tables where
each bucket can store a fixed constant number of items (typically determined by the
number of items that can be conveniently read in parallel). Of course, we can simu-
late a hashing scheme that uses chaining with an open-addressed hash table as long
as no bucket overflows, and then we just need to ensure that it is highly unlikely
for a bucket to overflow. Alternatively, we can work directly with open-addressed
hashing schemes that are explicitly designed with a limit on the number of items
9 Hash-Based Techniques for High-Speed Packet Processing 191
that can be stored in a bucket. In this case, for the sake of simplicity, we typically
assume that each bucket can hold at most one item. The results can usually be gen-
eralized for larger buckets in a straightforward way. The potential expense of using
open-addressed hash tables in these ways is that many buckets may be far from full,
wasting significant space.
The standard open-addressed multiple choice hash table is the multilevel hash
table (MHT) of Broder and Karlin [7]. This is a hash table consisting of d sub-
tables T1 ; : : : ; Td , with each Ti having one hash function hi . We view these tables
as being laid out from left to right. To insert an item x, we find the minimal i such
that Ti Œhi .x/ is unoccupied, and place x there. As above, we assume that each
bucket can store at most one item; in this case the MHT is essentially the same as a
d -left hash table with the restriction that each bucket can hold at most one item, but
the correspondence disappears for larger bucket sizes. If T1 Œh1 .x/; : : : ; Td Œhd .x/
are all occupied, then we declare a crisis. There are multiple things that we can do
to handle a crisis. The approach in [7] is to resample the hash functions and rebuild
the entire table. That work shows that it is possible to insert n items into a properly
designed MHT with O.n/ total space and d D log log n C O.1/ in O.n/ expected
time, assuming only 4-wise independent hash functions.
In Section 9.3, we discuss more recent work building on [7] that describes ways
to design MHTs so that no rehashings are necessary in practice. Essentially, follow-
ing [7], the idea is that if the Ti ’s are (roughly) geometrically decreasing in size,
then the total space of the table is O.n/. If the ratio by which the size of Ti C1
is smaller than Ti is, say, twice as large as the expected fraction of items that are
not stored in T1 ; : : : ; Ti , then the distribution of items over the Ti ’s decreases dou-
bly exponentially with high probability. This double exponential decay allows the
choice of d D log log n C O.1/. For a more detailed description of this intuition,
see [7] or [36].
We defer the details of the various ways to construct MHTs to Sections 9.3 and
9.4, where MHTs play a critical role. For the moment, however, we simply note
that MHTs naturally support deletions, as one can just perform a lookup on an item
to find its location in the table, and then mark the corresponding item as deleted.
Also, MHTs appear well suited to a hardware implementation. In particular, their
open-addressed nature seems to make them preferable to approaches that involve
chaining, and their use of separate sub-tables allows for the possibility that all of
the hash locations for a particular item can be accessed in parallel. Indeed, these
considerations are part of the original motivation from [7].
There is also a substantial amount of work in the theory literature on open-
addressed multiple choice hashing schemes that allow items in the table to be moved
during an insertion in order to increase space utilization [15, 21, 30, 38, 56, 57].
The most basic of these schemes is cuckoo hashing [30, 56], which works as
follows. There are d sub-tables T1 ; : : : ; Td , with each Ti having one hash func-
tion hi . When attempting to insert an item x, we check if any of its hash locations
T1 Œh1 .x/; : : : ; Td Œhd .x/ are unoccupied, and place it in an unoccupied bucket if
that is the case. Otherwise, we choose a random I 2 f1; : : : ; d g and evict the item
y in TI ŒhI .x/, replacing y with x. We then check if any of y’s hash locations
192 A. Kirsch et al.
are unoccupied, placing it in the leftmost unoccupied bucket if this is the case.
Otherwise, we choose a random J 2 f1; : : : ; d g fI g and evict the item z in
TJ ŒhJ .y/, replacing it with y. We repeat this procedure until an eviction is no
longer necessary.
Cuckoo hashing allows for a substantial increase in space utilization over a stan-
dard MHT with excellent amortized insertion times (even for small d , say, d D 4).
Unfortunately, however, in practice a standard cuckoo hash table occasionally ex-
periences insertion operations that take significantly more time than the average.
This issue is problematic for high-speed packet-processing applications that have
very strict worst-case performance requirements. We address this issue further in
Section 9.4.
Roughly speaking, the quality of the algorithms and data structures in this chapter
can be measured by their space requirements and speed for various operations. In
conventional algorithm analysis, speed is measured in terms of processing steps
(e.g., instructions). However, as a first approximation, we count only memory ac-
cesses as processing steps. In a hardware design, this approximation is usually
justified by the ability to perform multiple complex processing steps in a single
cycle in hardware (using combinatorial logic gates, which are plentiful). In a soft-
ware design, we can often ignore processing steps because instruction cycle times
are very fast compared to memory access times.
Thus our main application performance measures are usually the amount of
memory required and the number of memory accesses required for an operation
of interest. Unfortunately, these measures are more complicated than they may first
appear because there are different types of memories: fast memory (cache in soft-
ware, Static Random Access Memory (SRAM) in hardware), and slow memory
(main memory in software, Dynamic Random Access Memory (DRAM) in hard-
ware). The main space measure is typically the amount of fast memory required for
a technique. If a design only uses slow memory, then the amount of memory used
is often irrelevant because such memory is typically cheap and plentiful. Similarly,
if a design uses both fast and slow memory, the main speed measure is typically
the number of slow memory accesses (because fast memory accesses are negligible
in comparison). If a design uses only fast memory, then the speed measure is the
number of fast memory accesses.
To make this abstraction more concrete, we give a brief description and com-
parison of SRAM and DRAM. Typical SRAM access times are 1–2 ns for on-chip
SRAM and 5–10 ns for off-chip SRAM; it is possible to obtain on-chip SRAMs
with 0.5 ns access times. On-chip SRAM is limited to around 64 Mbits today. The
level 1 and level 2 caches in modern processors are built from SRAM.
In order to refresh itself, an SRAM bit cell requires at least five transistors. By
comparison, a DRAM cell uses only a single transistor connected to an output ca-
pacitance that can be manufactured to take much less space than the transistors in
9 Hash-Based Techniques for High-Speed Packet Processing 193
an SRAM. Thus SRAM is less dense and more expensive (per bit) than memory
technology based on DRAM. However, the compact design of a DRAM cell has
an important negative consequence: a DRAM cell requires higher latency to read
or write than the SRAM cell. The fastest off-chip DRAMs take around 40–60 ns
to access (latency) with longer times such as 100 ns between successive reads
(throughput). It seems clear that DRAM will always be denser but slower than
SRAM.
Moving on, three major design techniques are commonly used in memory sub-
system designs for networking chips and can be used for the algorithms and data
structures in this chapter. While we will not dwell on such low-level issues, it is
important to be aware of these techniques (and their limitations).
Memory Interleaving and Pipelining Many data structures can be split into
separate banks of an interleaved memory, where each bank is a DRAM. Mem-
ory accesses can then be interleaved and pipelined to facilitate parallelism. For
example, consider a binary search tree data structure, and suppose that it is split
into separate banks based on nodes’ locations in the tree. If we wish to perform
multiple lookups on the tree, we can boost performance by allowing operations
at different nodes of the tree to occur in parallel.
Wide Word Parallelism A common theme in many networking designs is to
use wide memory words that can be processed in parallel. This can be imple-
mented using DRAM and exploiting the page mode, or by using SRAM and
making each memory word wider. In software designs, wide words can be ex-
ploited by aligning data structures to the cache line size. In hardware designs,
one can choose the width of memory to fit the problem (up to say 5,000 bits or
so, after which electrical issues may become problematic).
Combining DRAM and SRAM Given that SRAM is expensive and fast, and
DRAM is cheap and slow, it makes sense to combine the two technologies to
attempt to obtain the best of both worlds. The simplest approach is to simply use
some SRAM as a cache for a DRAM database. While this technique is classical,
there are many more creative applications of the use of non-uniform memory
models; we will see some in this chapter.
We must also point out that routers often make use of content-addressable mem-
ories (CAMs), which are fully associative memories, usually based on SRAM
technology (so that improvements in SRAM technology tend to translate into im-
provements in CAM technology), that support a lookup of a data item in a single
access by performing lookups on all memory locations in parallel. There are also
ternary CAMs (TCAMs) that support wildcard bits, which is an extremely use-
ful feature for prefix match lookups, a critical router application discussed in
Section 9.2.3. This specialty hardware is much faster than any data structure built
with commodity SRAM, such as a hash table, could ever be. However, the parallel
lookup feature of CAMs causes them to use a lot more power than comparable
SRAMs. Furthermore, the smaller market for this sort of technology results in
CAMs being much more expensive, per bit, than SRAMs. For both peak power
194 A. Kirsch et al.
consumption and cost per bit, an order of magnitude difference between a CAM
and a comparable SRAM would not be surprising.
In this chapter, we regard CAMs as being expensive, special-purpose hardware
for table lookups that are practical only when storing small sets of items. Thus,
we do not think of CAMs as being a replacement for hash tables, but we do advo-
cate their use for parts of our hash-based data structures. In particular, in Section
9.4, we describe MHTs from which some small number of items are expected to
overflow. There, a CAM is used to store those items (but not all items, which is
prohibitively expensive), allowing excellent worst-case lookup times for the entire
data structure.
We start by examining a problem first addressed by Song et al. [65]. Recall that in
a multiple choice hashing scheme with d hash functions, one performs a lookup
for an item x by examining (in the worst case) all d hash locations corresponding
to x. In a hardware application, it may be reasonable to implement this procedure
by examining all d locations in parallel, particularly if the hash table memory is
stored on the chip that is performing the lookup. However, if the hash table must be
stored off-chip (due to its size), then performing all d of these lookups in parallel
may introduce a prohibitively expensive cost in terms of chip I/O, particularly the
9 Hash-Based Techniques for High-Speed Packet Processing 197
number of pins on the chip that are needed to access the hash table. It then becomes
natural to ask whether we can design some sort of on-chip summary that can reduce
the number of worst-case (off-chip) memory accesses to the hash table from d to,
say, 1. This is the question addressed in [65].
More formally, the summary answers questions of the following form, “Is item x
in the hash table, and if so, in which of its d hash locations is it?” The summary is
allowed some small false positive probability (e.g., 0.5%) for items not in the hash
table, since these are easily detected and so do not significantly impact performance
as long as they are infrequent. However, if a queried item x is in the hash table,
the summary should always correctly identify the hash location used to store x,
unless some sort of unlikely failure condition occurs during the construction of the
summary. (In particular, if the summary is successfully constructed, then it does not
generate false negatives and the worst-case number of off-chip memory accesses
is 1.) The objective is now to design a summary data structure and the corresponding
hash table so that they are efficient and the summary data structure is successfully
constructed with overwhelming probability.
The basic scheme proposed by Song et al. [65] is as follows. (Essentially, it is
a combination of a counting Bloom filter variant for the summary with a variant
of d -way chaining for the hash table.) For simplicity, we start by describing the
variant where the hash table is built by inserting n items, and after that it is never
modified. Here, the hash table consists of m buckets and d hash functions, and the
summary consists of one b-bit counter for each bucket. When an item is inserted into
the hash table, it is placed in all of its d hash buckets and all of the corresponding
counters are incremented. Then the hash table is pruned; for each item in the table,
the copy in the bucket whose corresponding counter is minimal (with ties broken
according to the ordering of the buckets) is kept, and the rest are deleted. A query to
the summary for an item x is now answered by finding the smallest of its d counters
(again with ties broken according to the ordering of the buckets). If the value of this
counter is 0, then x cannot be in the table, and otherwise x is presumed to be in the
corresponding bucket.
Song et al. [65] give several heuristic improvements to this basic scheme in order
to reduce collisions in the underlying hash table and optimize performance for vari-
ous applications. The heuristics appear effective, but they are only analyzed through
simulations, and do not appear to be amenable to theoretical or numerical analy-
ses. Insertions can be handled, but can necessitate moving items in the hash table.
Deletions are significantly more challenging in this setting than in most hash ta-
ble constructions. In particular, it may be necessary to keep a copy of the entire
hash table (or a smaller variant called a shared-node fast hash table) before pruning;
see [65] for details.
Kirsch and Mitzenmacher [36] build on [65], offering alternative approaches
more amenable to analysis. They propose the use of an MHT as the underlying
hash table. This gives a worst case bound on the number of items in a bucket, al-
though it introduces the possibility of a crisis, which must be very unlikely in order
for the MHT to give good performance. Furthermore, since the distribution of items
over the sub-tables of an MHT decays doubly exponentially with high probability,
198 A. Kirsch et al.
it suffices to design summaries that perform well under the assumption that most of
the items in the MHT are in the first sub-table, most of the rest are in the second, etc.
Kirsch and Mitzenmacher [36] propose two summary design techniques based on
this observation, both based on Bloom filter techniques. We review the one that is
easier to describe and motivate here. As before, for simplicity we start by assuming
that we are only interested in building a hash table and corresponding summary for n
items by inserting the n items sequentially, and then the data structures are fixed for
all time. If the MHT consists of d sub-tables T1 ; : : : ; Td , then the summary consists
of d Bloom filters F1 ; : : : ; Fd . Each filter Fi represents the set of all items stored
in Ti ; : : : ; Td . To perform a query for an item x, we first check whether F1 yields a
positive for x; if not, then x cannot be in the MHT. Otherwise, we find the largest i
where Fi returns a positive for x, and declare that x is in Ti .
The first important observation here is that F1 is simply a standard Bloom filter
for the set of items stored in the MHT. Thus, false positives for F1 merely yield
false positives for the summary. As before, such false positives are acceptable as
long as they are sufficiently infrequent (e.g., the false positive probability of F1 is
0.5%). However, if an item x is in Ti , then it will be inserted into F1 ; : : : ; Fi but not
Fi C1 . If Fi C1 gives a false positive for x, then querying the summary for x yields
an incorrect answer, which is unacceptable to us because x is actually in the MHT.
Thus, F2 ; : : : ; Fd must have extremely small false positive probabilities.
The second important observation is the effect of the distribution of the items
over the sub-tables of the MHT on the quality of the summary. Recall that for a
typical MHT, with high probability, most of the items are stored in T1 , most of the
rest are stored in T2 , etc. In fact, the distribution of items over the sub-tables decays
doubly exponentially. In particular, if Si is the set of items stored in Ti ; : : : ; Td , then
Si is almost surely small for i 2. Thus, for i 2, since Fi is just a standard Bloom
filter for Si , we can achieve an extremely small false probability for Fi without using
too much space. Since we only need a moderate false positive probability for F1 (as
described above), we can adequately construct it in a reasonable amount of space.
A further advantage of the approach of [36] is that one can obtain fairly precise
estimates of the probabilities that the relevant hash-based data structures fail using
numerical techniques. This property is very useful, as purely theoretical techniques
often obscure constant factors and simulation results can be computationally expen-
sive to obtain (especially when the probabilities being estimated are very small,
as they should be for the sorts of failure probabilities that we are interested in
bounding). Numerical techniques offer the potential for very accurate and predic-
tive results for only a moderate computational cost.
For the summary approach described above, it is fairly easy to see that if we
can obtain numerical information about the distributions of the sizes of the Si ’s,
then the standard Bloom filter analysis can be directly applied to estimate the failure
probability of the summary. The primary issue then becomes obtaining this numer-
ical information (note that the probability of a crisis is just Pr.jS1 j < n/, and so it
too follows from this information). Now, if m1 ; : : : ; md are the sizes of T1 ; : : : ; Td ,
then it is fairly easy to see that the distribution of the number of items that are stored
in T1 is the same as the number of bins that are occupied if we throw n balls at
9 Hash-Based Techniques for High-Speed Packet Processing 199
random into m1 buckets. Furthermore, for i > 1, the conditional distribution of the
number of items in Ti given that Ni 1 items are stored in T1 ; : : : ; Ti 1 is the same
as the number of bins that are occupied if we throw n Ni 1 balls into mi buckets.
These balls-and-bins probabilities are easy to compute and manipulate to obtain the
desired information about the distribution of the Si ’s; for details, see [36]. As an ex-
ample of the improvement, for 10,000 items these techniques allow for a hash table
that uses 50% of the space as the hash table in [65] and an accompanying summary
that uses about 66% of the space of the corresponding summary from [65], for a
comparable false positive probability and a failure probability of about 5 1012 ;
different tradeoffs are possible (e.g., a larger hash table to get more skew in the
distribution of the items, allowing for a smaller summary).
As is usually the case for hash-based data structures, it is more difficult to mea-
sure the effectiveness of these techniques when deletion operations are allowed.
However, we do note that the summary construction above can be modified to handle
deletions by replacing the Bloom filters with counting Bloom filters. Unfortunately,
while the MHT supports deletions in the natural way (an item can simply be re-
moved from the table), intermixing deletion and insertion operations can have a
major impact on the distribution of the Si ’s, which is critical for the failure prob-
ability guarantees. Thus, while the approach of [36] seems adaptable to deletions,
understanding and mitigating the impact of deletions on these sorts of data structures
remains an important open problem.
In more practical work, Kumar et al. [42, 43] use similar ideas to construct alter-
native high-performance hash tables. The paper [42] presents a variant of a multiple
choice hash table. A hash table is broken into multiple segments, each of which can
be thought of as a separate hash table; each item chooses a bucket in each segment.
If an item cannot be placed without a collision, a standard collision-resolving tech-
nique, such as double hashing, is applied to the segment where the item is placed.
To avoid searching over all segments when finding an item, a Bloom filter is used
for each segment to record the set of items placed in that segment. When an item is
placed, the first priority is to minimize the length of the collision chain, or the search
time, for that item. Often, however, there will be ties in the search length, in which
case priority is given to a segment where the new item will introduce the fewest new
1’s into the Bloom filter; this reduces the chances of false positives. (Some related
theory is presented in [48].)
The paper [43] introduces peacock hashing, which takes advantage of the skewed
construction developed for MHTs. The main innovation of [43] appears to arise from
using more limited hash functions in order to improve rebalancing efforts in case of
deletion. In sub-tables beyond the first, the possible locations of an item depend on
its location in the previous table. Because of this, when an item is deleted, there
are only a very small number of possible locations in the subsequent sub-tables that
need to be checked to see if an item from a later, smaller table can be moved to the
now empty position in the larger, earlier table, potentially reducing the probability
of a crisis. In exchange, however, one gives up some of the power of hashing each
item independently to multiple locations. Also, currently peacock hashing lacks a
complete mathematical analysis, making it hard to compare to other schemes except
by experiment. Some analysis and alternative schemes are suggested in [34].
200 A. Kirsch et al.
We now turn our attention to the setting where it is feasible to examine all of an
item’s hash locations in a multiple choice hash table in parallel. In this case, our
goal becomes to increase the space utilization of our hash table constructions, while
ensuring that they are amenable to hardware implementations for high-speed ap-
plications. For instance, one can think of these techniques as potentially enabling
us to take an off-chip hash table and decrease its space overhead enough so that it
can be effectively implemented on-chip, thus eliminating the chip I/O problem from
Section 9.3 and replacing it with a very restrictive memory constraint.
As discussed in Section 9.2.1.3, there are a number of hashing schemes in the
theory literature that allow items to be moved in the table during the insertion of a
new item in order to increase the space utilization of the table. Unfortunately, while
these schemes have excellent amortized performance, they occasionally allow for
insertion operations that take a significant amount of time. For high-speed packet-
processing applications, such delays may be unacceptable.
We discuss two approaches to this problem. First, we consider new hashing
schemes that are designed to exploit the potential of moves while ensuring a rea-
sonable worst-case time bound for a hash table operation. Second, we consider a
more direct adaptation of an existing scheme from the theory literature (specifically,
cuckoo hashing) to this setting, striving for a de-amortized performance guarantee.
The first approach is taken by Kirsch and Mitzenmacher in [35]. That work pro-
poses a number of modifications to the standard MHT insertion scheme that allow
at most one move during an insertion operation. The opportunity for a single move
demonstrates that the space utilization of the standard MHT can be significantly in-
creased without a drastic increase in the worst-case time of a hashing operation. The
proposed schemes are all very similar in spirit; they differ primarily in the tradeoffs
between the amount of time spent examining potential moves during an insertion
and the resulting increases in space utilization.
The core idea behind these schemes is best illustrated by the following procedure,
called the second chance scheme. Essentially, the idea is that as we insert items into
a standard MHT with sub-tables T1 ; : : : ; Td , the sub-tables fill up from left to right,
with items cascading from Ti to Ti C1 with increasing frequency as Ti fills up. Thus,
a natural way to increase the space utilization of the table is to slow down this
cascade at every step.
This idea is implemented in the second chance scheme in the following way.
We mimic the insertion of an item x using the standard MHT insertion procedure,
except that if we are attempting to insert x into Ti , if the buckets Ti Œhi .x/ and
9 Hash-Based Techniques for High-Speed Packet Processing 201
Ti C1 Œhi C1 .x/ are occupied, rather than simply moving on to Ti C2 as in the standard
scheme, we check whether the item y in Ti Œhi .x/ can be moved to Ti C1 Œhi C1 .y/.
If this move is possible (i.e., the bucket Ti C1 Œhi C1 .y/ is unoccupied), then we
perform the move and place x at Ti Œhi .x/. Thus, we effectively get a second chance
at preventing a cascade from Ti C1 to Ti C2 . (We note that this basic idea of relocating
items to shorten the length of hash chains appears to have originated with the work
of Brent on double-hashing [6].)
Just as in the standard MHT insertion scheme, there may be items that cannot be
placed in the MHT during the insertion procedure. Previously, we considered this
to be an extremely bad event and strived to bound its probability. Here we take a
different perspective and say that if an item x is not successfully placed in the MHT
during its insertion, then it is placed on an overflow list L, which, in practice, would
be implemented with a CAM. (A similar idea is found in [30], where it is dubbed a
filter cache, and the implementation of the overflow list is different.) To perform a
lookup, we simply check L in parallel with the MHT.
It turns out that since the second chance scheme only allows moves from left to
right, it is analyzable by a fluid limit or mean-field technique, which is essentially a
way of approximating stochastic phenomena by a deterministic system of differen-
tial equations. The technique also applies to the standard MHT insertion procedure,
as well as a wide variety of extensions to the basic second chance scheme. This
approach makes it possible to perform very accurate numerical analyses of these
systems, and in particular it allows for some interesting optimizations. We refer
to [35] for details, but as an example, we note that when both the standard MHT
insertion scheme and second chance insertion scheme are optimized to use as lit-
tle space as possible with four hash functions so that no more than 0:2% of the
items are expected to overflow from the table under either scheme, then the second
chance scheme requires 72% of the space of the standard scheme, with about 13%
of insertion operations requiring a move.
The second chance scheme is also much more amenable to a hardware implemen-
tation than it may at first seem. To insert an item x, we simply read all of the items
y1 D T1 Œh1 .x/; : : : ; yd D Td Œhd .x/ in parallel. Then we compute the hashes
h2 .y1 /; : : : ; hd .yd 1 / in parallel. (Here, for notational simplicity, we are assuming
that all of T1 Œh1 .x/; : : : ; Td Œhd .x/ are occupied, so that the yi ’s are well defined;
it should be clear how to handle the general case.) At this point, we now have all
of the information needed to execute the insertion procedure without accessing the
hash table (assuming that we maintain a bit vector indicating which buckets of the
table are occupied).
The second chance scheme and its relatives also support deletions in the natural
way: an item can simply be removed from the table. However, as in Section 9.3, the
intermixing of insertions and deletions fundamentally changes the behavior of the
system. In this case, the differential equation approximations become much more
difficult and heuristic, but still useful. For details, see [35]; for further experimental
analysis of schemes that make one move on either an insertion or a deletion, see [34].
202 A. Kirsch et al.
We now discuss some possible adaptations of the standard cuckoo hashing scheme,
proposed by Kirsch and Mitzenmacher [37], to obtain better de-amortized per-
formance. Recall that in standard cuckoo hashing, the insertion of an item x
corresponds to a number of sub-operations in the following way. First, we attempt
to insert x into one of its hash locations. If that is unsuccessful, then we choose one
of x’s hash locations at random, evict the item y in that place, and replace it with x.
We then attempt to place y into one of its hash locations, and, failing that, we choose
one of y’s hash locations other than the one from which it was just evicted at ran-
dom, evict the item z in that location, and replace it with y. We then attempt to
place z similarly.
We think of each of these attempts to place an item in its hash locations as a
sub-operation. In a hardware implementation of cuckoo hashing, it is natural to con-
sider an insertion queue, implemented in a CAM, which stores sub-operations to be
processed. To process a sub-operation, we simply remove it from the queue and ex-
ecute it. If the sub-operation gives rise to another sub-operation, we insert the new
sub-operation into the queue. Generally speaking, the queue is implemented with
some policy for determining the order in which sub-operations should be processed.
For the standard cuckoo hashing algorithm, this policy would be for sub-operations
coming from newly inserted items to be inserted at the back of the queue, and a
sub-operation arising from a sub-operation that was just executed to be inserted at
the front of the queue.
The key feature of this approach is that we can efficiently perform insertions,
lookups, and deletions even if we reorder sub-operations. Indeed, an insertion can
be performed by inserting a single sub-operation into the queue, and a lookup can be
performed by examining all of the items’ hash locations in the table and the entire
queue in parallel (since the queue is implemented with a CAM). To perform a dele-
tion, we check whether the item is in the table, and if so we mark the corresponding
bucket as deleted so that the item is overwritten by a future sub-operation. If the
item is in the queue, we remove the corresponding sub-operation from the queue.
Since the queue must actually fit into a CAM of modest size (at least under or-
dinary operating conditions), the main performance issue is the size of the queue
when it is equipped with a particular policy. For instance, the problem with stan-
dard cuckoo hashing policy is that it can become “stuck” attempting to process an
unusually large number of sub-operations arising from a particularly troublesome
insertion operation, allowing new insertion operations to queue up in the mean time.
A natural first step towards fixing this problem is to insert the sub-operations corre-
sponding to newly inserted items on the front of queue, rather than on the back. In
particular, this modification exploits the fact that a newly inserted item has a chance
of being placed in any of its d hash locations, whereas an item that was just evicted
from the table has at most d 1 unoccupied hash locations (assuming that the item
responsible for the eviction has not been deleted).
Another useful observation comes from introducing the following notion of the
age of a sub-operation. If a sub-operation corresponds to the initial attempt to insert
9 Hash-Based Techniques for High-Speed Packet Processing 203
an item, then that sub-operation has age 0. Otherwise, the sub-operation results from
the processing of another sub-operation with some age a, and we say that the new
sub-operation has age a C 1. The previous queuing policy can then be thought of as
a modification of the standard policy to give priority to sub-operations with age 0.
More generally, we can introduce a policy in which insertion operations are prior-
itized by their ages. Intuitively, this modification makes sense because the older a
sub-operation is, the more likely the original insertion operation that gave rise to it
is somehow troublesome, which in turn makes it more likely that this sub-operation
will not give a successful placement in the table, resulting in a new sub-operation.
While it may not be practical to implement the insertion queue as a priority queue
in this way, since sub-operations with large ages are fairly rare, we should be able to
approximate the performance of the priority queue with following approach. As be-
fore, sub-operations corresponding to an initial insertion of an item are placed on the
front of the queue. Furthermore, whenever the processing of a sub-operation yields
a new sub-operation, the new sub-operation is placed on the back of the queue.
All of these policies are evaluated and compared empirically in [37]. (It does
not seem possible to conduct a numerical evaluation here, due to the complexity
of mathematically analyzing cuckoo hashing.) Overall, the results indicate that all
of the intuition described above is accurate. In particular, the last queuing policy is
extremely practical and performs substantially better than the standard policy over
long periods of time. More specifically, the size of the queue under the standard
policy is much more susceptible to occasional spikes than the last policy. In practice,
this observation means that when the insertion queue is implemented with a CAM
that should hold the entire queue almost all of the time, the last policy is likely to
perform much better than the original one.
This section describes some additional improvements and applications of Bloom fil-
ters for high-speed packet processing that have been proposed in recent work. We
start by describing some improvements that can be made to the standard Bloom
filter and counting Bloom filter data structures that are particularly well suited to
hardware-based networking applications. Then we describe the approximate con-
current state machine, which is a Bloom filter variant that makes use of these ideas
to efficiently represent state information for a set of items, as opposed to mem-
bership, as in a standard Bloom filter. Finally, we review a number of additional
applications of Bloom filters to a wide variety of high-speed networking problems.
Standard counting Bloom filters, by their very nature, are not particularly space-
efficient. Using the standard optimization for false positives for a Bloom filter from
204 A. Kirsch et al.
Section 9.2.1.2, the value for a particular counter in a counting Bloom filter is 0
with probability approximately 1=2. Using multiple bits (e.g., 4, which is the usual
case) to represent counters that take value 0 roughly half the time is an inefficient
use of space. Some space gains can be made in practice by introducing additional
lookups; counters can be kept to two bits and a secondary table can be used for
counters that overflow. More sophisticated approaches exist, but their suitability for
hardware implementation remains untested (see, e.g., [12, 55]).
To address this issue, Bonomi et al. [5] develop new constructions with the same
functionality as a Bloom filter and counting Bloom filter, based on d -left hashing.
These schemes are designed particularly for hardware implementation. In particular,
they generally reduce the number of hashes and memory accesses required by the
standard data structures. The idea behind these constructions actually first appears
in another work by Bonomi et al. [3], where an extension to the Bloomier filter
dubbed approximate concurrent state machines, or ACSMs, are developed. Here we
describe the Bloom filter and counting Bloom filter variants, and discuss ACSMs in
Section 9.5.2.
The starting point for these constructions is the folklore result that one can ob-
tain the same functionality as a Bloom filter for a static set S with near-optimal
performance using a perfect hash function. (A perfect hash function is an easily
computable bijection from S to an array of jS j hash buckets.) One finds a per-
fect hash function P , and then stores at each hash location an f D dlog 1="e bit
fingerprint, computed according to some other hash function H . A query on z re-
quires computing P .z/ and H.z/, and checking whether the fingerprint stored at
P .z/ matches H.z/. When z 2 S a correct response is given, and when z … S a false
positive occurs with probability at most "; this uses ndlog 1="e bits for a set S of n
items.
The problem with this approach is that it does not cope with changes in the set
S – either insertions or deletions – and perfect hash functions are generally too ex-
pensive to compute in many settings. To deal with this, we make use of the fact,
recognized by Broder and Mitzenmacher [9] in the context of designing hash-based
approaches to IP lookup (along the lines of the work by Waldvogel et al. [74] dis-
cussed in Section 9.2.3), that using d -left hashing provides a natural way to obtain
an “almost perfect” hash function. The resulting hash function is only almost perfect
in that instead of having one set item in each bucket, there can be several (there are
d possible locations for each item), and space is not perfectly utilized.
An example demonstrates the idea behind the approach; details are presented
in [5]. Suppose we wish to handle sets of n items. We utilize a d -left hash table
with d D 3 choices per item, so that on insertion each item chooses one bucket
from each of three sub-tables uniformly at random, and the fingerprint for the item
is then stored in the least loaded of the three choices. Each sub-table will have
n=12 buckets, for n=4 buckets in total, giving an average of 4 items per bucket.
The maximum number of items in a bucket will be 6 with high probability (for
large n, the probability converges to a value greater than 1 1030 ). Hence we can
implement the hash table as a simple array, with space for 6 fingerprints per bucket,
9 Hash-Based Techniques for High-Speed Packet Processing 205
where .bi ; ri / are the bucket and fingerprint for the i th sub-table. Now, two items
x and y will share a fingerprint and bucket if and only if they have the same hash
h.x/ D h.y/, so that a small counter (generally 2 bits) can be used to keep track of
collisions for items under the hash function h.
206 A. Kirsch et al.
Bloom filters have also recently been employed for more sophisticated packet pro-
cessing and packet classification tasks in hardware. A useful example design is
given in [16, 18], where the question being tackled is how to find specific sub-
strings, commonly called signatures, in packets at wire speeds. A common current
9 Hash-Based Techniques for High-Speed Packet Processing 207
use of signatures is to scan for byte sequences particular to Internet worms, allowing
malicious packets to be dropped. However, other natural uses arise in a variety of
settings.
If we think of a collection of signatures as being a set of strings, then a natural ap-
proach is to represent this set with a Bloom filter. More specifically, it makes sense to
separate signatures by length, and use a Bloom filter for each length, allowing the
Bloom filters to be considered in parallel as the bytes of the packet are shifted
through as a data stream. In order to obtain a suitable hardware implementation,
however, there are further details to consider. For example, to handle the possi-
ble deletion and insertion of signature strings, one can use an associated counting
Bloom filter. As insertions and deletions are likely to be rare, these counting Bloom
filters can be kept separately in slower memory [16]. To avoid costly hashing over-
head for longer signatures, such strings can be broken into smaller strings, and a
small amount of state is kept to track how much of the string has been seen. In such
a setting, the Bloom filter can also be used to track the state (in a manner similar to
one of the approaches suggested for approximate concurrent state machines [3]).
Using Bloom filters allows a large database of signature strings to be effectively
represented with a small number of bits, making use of fast memory in hardware
feasible. Thousands and even tens of thousands of strings can be effectively dealt
with while maintaining wire speed [16].
Bloom filters have also been proposed for use in various longest prefix matching
implementations [17, 19]. Many variations of IP lookup algorithms, for example,
create hash tables consisting of prefixes of various lengths that have to potentially
be matched against a given input IP address, with the goal of finding the longest pos-
sible match. The number of accesses to the hash tables can potentially be reduced by
using a Bloom filter to record the set of prefixes in each hash table [17]. By check-
ing the Bloom filter, one can avoid an unnecessary lookup into a hash table when
the corresponding prefix does not exist in the table. Notice, though, that because of
false positives, one cannot simply take the longest match suggested by the Bloom
filters themselves; the hash table lookup must be done to check for a true match. The
average number of hash table lookups, however, is reduced dramatically, so under
the assumption that hash table lookups are dramatically slower or more costly than
Bloom filter lookups, there are substantial gains in performance.
Bloom filters can also be used in multi-dimensional longest prefix matching ap-
proaches for packet classification, potentially giving a cheaper solution than the
standard TCAM approach [19]. The solution builds on top of what is known as
the cross-product algorithm: find the longest prefix match on each field, and hash
the resulting vector of longest matches into a hash table that will provide the
packet classification rules associated with that vector. Unfortunately, straightfor-
ward implementations of the cross-product rule generally lead to very large hash
tables, because the cross-product approach leads to a large number of prefix vectors,
roughly corresponding to the product of the number of prefixes in each field. This
creates a significant overhead. An alternative is to split the rules into subsets and
perform the cross-product algorithm on each subset, in order to reduce the overall
size for hash tables. To keep the cost of doing longest prefix matchings reasonable,
208 A. Kirsch et al.
for each field, the longest prefix match is performed just once, over all subsets.
The problem with this approach is now that for each subset of rules, for each field,
there are multiple prefix lengths that might be possible; conceivably, any subprefix
of the longest prefix match over all subsets of rules could apply to any specific sub-
set of rules. To avoid hash table lookups for all possible combinations of prefixes
over all subsets of rules, a Bloom filter of valid prefix vectors for each subset of
rules can be maintained, reducing the number of necessary lookups in the hash table
in a spirit similar to [17].
Results based on this approach yield a solution that requires at most 4Cr memory
accesses on average when r rules can match a packet; this can be further reduced
with pipelining. Memory costs range on the order of 32 to 45 bytes per rule, allowing
reasonably large rule sets to be effectively handled in SRAM.
Another recently proposed approach makes use of multiple choice hashing in
order to reduce memory usage for IP lookup and packet classification algorithms, as
well as other related algorithms. The setting here revolves around the fact that many
algorithms for these types of problems reduce to directed graph traversal problems.
Longest prefix matching structures are often represented by a trie, where one finds
the longest prefix match by walking down the trie. Similarly, regular expression
matching structures are often represented by finite automata, where one finds the
match by a walk on the corresponding automata graph.
Kumar et al. [44] describe an approach for compressing the representation of tries
and other directed graphs, by avoiding using pointers to name nodes. Instead, the
history of the path used to reach a node is used as a key for that node, and a multiple-
choice hash structure stores the graph information associated with the node, such as
its neighbors, in a format that allows continued traversal. For this approach to be
successful, there should be no collisions in the hash table. This is best accomplished
with low space overheads using cuckoo hashing, which works well even in the face
of updates to the underlying graph that change the set of keys being stored. By
avoiding the use of expensive node identifiers, a factor of 2 or more in space can be
saved over standard representations with minimal additional processing.
Another recent methodology for longest prefix matching problems based on
hash-based approaches was described as part of the Chisel architecture [33]. Here
the underlying technology used is a Bloomier filter. Prefixes correspond to keys,
which are stored in the filter. One issue with this approach is that Bloomier filters
do not support updates; insertions, in particular, can require the Bloomier filter to
be reconstructed, which can take time linear in the number of keys. (See related
lower bounds in [14].) The trace analysis in [33] suggests that updates can be done
quickly without reconstruction in most cases, but it is not clear whether this holds
more generally, and there is not currently a theoretical justification for this finding.
The authors also develop other techniques to handle issues particular to the longest
prefix matching problem in this context, such as how to cope with wildcard bits in
the keys.
9 Hash-Based Techniques for High-Speed Packet Processing 209
Today’s routers provide per-interface counters that can be read by the management
protocol SNMP, and/or a much more expensive solution that involves sampling
packets using the NetFlow protocol [23]. The problem with this state of affairs is
that SNMP counters are extremely coarse, while NetFlow is extremely expensive.
Indeed, for the SNMP approach, there is only a single counter for all packets re-
ceived and sent on an interface. In particular, there is no way to find how much a
particular source is sending over the interface. Sampling packets with NetFlow ad-
dresses such issues, but often with prohibitive cost. For instance, even if we only
sample one in a thousand packets, the wire speed may be such that gigabits of data
are collected every minute. Much of this data is lost, and the rest (which is still
substantial) must be transferred to a measurement station, where it is logged to disk
and eventually post-processed. Furthermore, despite its cost, NetFlow does not pro-
vide accurate answers to natural questions, such as the number of distinct source
addresses seen in packets [25]. In this section, we describe hash-based measure-
ment algorithms that can be implemented in hardware and can answer questions
that SNMP counters cannot in a way that is much more direct and less expensive
than NetFlow. The setting for all of these algorithms is a single interface in a router,
and the algorithm is implemented by some logic or software associated with the link.
Estan and Varghese [24] also describe a sampling-based technique called sample-
and-hold for heavy-hitters estimation in which each packet is sampled indepen-
dently with some very low probability and all subsequent occurrences of a sampled
packet’s flow are counted exactly in a CAM. A problem with this approach is that
the CAM may be polluted with small flows. To address this issue, Lu et al. [47]
propose elephant traps that, in essence, enhance sample-and-hold by periodically
removing small flows from the CAM using a Least Recently Used-like algorithm
that can be efficiently implemented in hardware. Since this chapter concentrates on
hash-based algorithms, we do not describe this approach any further.
A general approach to measurement using hash-based counters is described by
Kumar et al. [41]. The idea is that, over some time interval, all flows are hashed in
parallel to several counters as in the count-min sketch. At the end of the interval,
the counters are sent to software to be estimated for the measures of interest. For
example, Kumar et al. [41] describe how to estimate the flow size distribution (i.e.,
the number of packets sent by each flow). For simplicity, assume that each counter is
incremented by one for each packet and not by the number of bytes. The main idea is
to use expectation maximization to iteratively estimate the flow size distribution. We
first estimate the current vector of flow sizes coarsely using the hardware counters.
Then we calculate expectations from the estimates, replace the original estimates
with these expectations, and iterate the process until convergence.
The idea of using more complex iterative estimators in software at the end of
the time interval is also applied in later work on counter braids by Lu et al. [45].
The hardware setup generalizes the earlier work on count-min sketches in that it
allows different levels of counters (two seems to suffice), with each level having a
smaller number of counters than the previous levels. A flow is hashed in parallel to
some number of counters in a table at the first level. If these counters overflow, the
flow is then hashed to some number of counters in a table at the next level (again
the mapping is done using a hash function), and so on. At the end of the interval an
iterative estimate of the counts for each flow is provided, similar to the technique
of [41]. However, the authors use tools inspired by coding theory (in particular,
turbo-code decoding) rather than expectation maximization.
One problem with both techniques in [41, 45] is that since estimates are found
only at the end of certain time intervals, the techniques lose information about flows.
By contrast, a standard count-min sketch uses a simple estimator that can be com-
puted in real time. Indeed, at the instant that a flow’s estimator crosses a threshold,
the flow ID can be logged. The techniques of [41, 45], however, lose all information
about flow IDs. This is acceptable in [41] because the measure (flow size distribu-
tion) does not require any flow IDs (while heavy-hitters clearly do). This problem
is addressed in [45] by assuming that either the flows to be measured are already
known (in which case a simple CAM suffices to determine whether a packet should
be processed) or that all flow IDs are logged in slow memory. In the latter case, the
real gain of [45] is to log all flows in slow and large DRAM, but to update flow
size information in a much smaller randomized structure in SRAM. Despite this
problem, the general technique of using sophisticated iterative estimators, whether
directed by expectation maximization or decoding techniques, seems like a promis-
ing direction for future work.
9 Hash-Based Techniques for High-Speed Packet Processing 211
The number of flows on a link is a useful indicator for a number of security applica-
tions. For example, the Snort intrusion detection tool detects port scans by counting
all the distinct destinations sent to by a given source, and sounding an alarm if this
amount is over a threshold. Similarly, to detect a denial of service attack, one might
want to count the number of sources sending to a destination because many such
attacks use multiple forged addresses. In both examples, it suffices to count flows,
where a flow identifier is a destination (for detecting port scans) or a source (for
detecting denial of service attacks).
A naive method to count, say, source–destination pairs would be to keep a counter
together with a hash table that stores all of the distinct 64 bit source–destination ad-
dress pairs seen thus far. When a packet arrives with source–destination address
pair .s; d /, the algorithm searches the hash table for .s; d /; if there is no match,
the counter is incremented and .s; d / is added to the hash table. Unfortunately,
this solution requires memory proportional to the total number of observed source–
destination pairs, which is prohibitively expensive.
An algorithm due to Flajolet and Martin based on probabilistic counting [29] can
considerably reduce the memory needed by the naive solution at the cost of some
accuracy in counting flows. The intuition behind the approach is to compute a metric
of how rare a certain pattern is in a random hash of a flow ID, and define the rarity
r.f / of a flow ID f to be the rarity of its corresponding hash. We then keep track
of the largest value X of r.f / ever seen over all flow IDs f that pass across the
link. If the algorithm sees a very large value for X , then by our definition of rarity,
it stands to reason that there is a large number of flows across the link.
More precisely, for each packet seen the algorithm computes a hash function
on the flow ID. It then counts the number of consecutive zeroes starting from the
least significant position of the hash result: this is the measure r./ of rarity used.
The tracked value X now corresponds to the largest number of consecutive zeroes
seen (starting from the least significant position) in the hashed flow ID values of all
packets seen so far.
At the end of some time interval, the algorithm converts X into an estimate 2X
for the number of flows. Intuitively, if the stream contains two distinct flows, on
average one flow will have the least significant bit of its hashed value equal to zero;
if the stream contains eight flows, on average one flow will have the last three bits
of its hashed value equal to zero – and so on. Thus, 2X is the natural estimate of the
number of flows corresponding to the tracked value X .
Hashing is essential for two reasons. First, implementing the algorithm directly
on the sequence of flow IDs itself could make the algorithm susceptible to flow ID
assignments where the traffic stream contains a flow ID f with many trailing zeroes.
If f is in the traffic stream, even if the stream has only a few flows, the algorithm
without hashing will wrongly report a large number of flows. Notice that adding
multiple copies of the same flow ID to the stream will not change the algorithm’s
final result, because all copies hash to the same value.
212 A. Kirsch et al.
A second reason for hashing is that accuracy can be boosted using multiple
independent hash functions. The basic idea with one hash function can guarantee
at most 50% accuracy. By using n independent hash functions in parallel to com-
pute n separate estimates X1 ; : : : ; Xn , we can greatly reduce the error by calculating
the mean M of the Xi ’s and returning 2M as the estimated number of flows. (Note
that M should be represented as a floating point number, not an integer.)
More modular algorithms for flow counting (again hash-based) for networking
purposes are described by Estan et al. in [25]. Suppose we wish to count up to
64,000 flows. Then it suffices to hash each flow into a single bit in a 64,000 bit
map (initially zeroed), and then estimate the number of flows accounting for hash
collisions by counting the number of ones.
However, just as Hubble estimated the number of stars in the universe by sam-
pling the number of stars in a small region of space, one could reduce memory to,
say, 32 bits but still hash flows from 1 to 64,000. In this case, flows that hash to val-
ues beyond 32 would not set bits but flows that hash from 0 to 31 would. At the end
of some time interval the number of bits set to 1 are counted, an estimate is found by
correcting for collisions, and then the estimate is scaled up by 64;000=32 D 2;000
to account for the flows that were lost. Unfortunately, the accuracy of this estimator
depends critically on the assumption that the number of flows is within some con-
stant factor of 64,000. If the number of flows is much smaller (e.g., 50), the error is
considerable.
The accuracy can be greatly improved in two ways. In a parallel approach, an
array of, say, 32 counters are used, each responsible for estimating the number of
flows in different ranges (e.g., 1–10, 10–100, 100–1,000) using a logarithmic scale.
Thus, at least one counter will be accurate. The counter used for the final estimate is
the one which has neither too many bits set or too few bits set. The precise algorithm
is in [25].
In some cases, even this memory can be too much. An algorithm proposed by
Singh et al. [63] addresses this issue by using only one counter and adapting se-
quentially to reduce memory (at the cost of accuracy). The algorithm has a bit map
of say 32 bits. The algorithm initially hashes all flows to between 0 and 31. If all the
bits fill up, the number of flows is clearly greater than 32, so the algorithm clears
the bitmap, and now hashes future flows (all memory is lost of older flows) from
0 to 64. It keeps doubling the range of the hash function until the number of bits
stops overflowing. The range of the hash function is tracked in a small scale factor
register. The net result is that flow counting can be done for up to a million flows
using a bit map of size 32 and a 16-bit scale factor register at a cost of a factor of 2
loss in accuracy (better tradeoffs are described in [63]).
Singh et al. [63] also describe how both the heavy-hitters and flow size estimators
can be used together to extract worm signatures. Any string of some pre-specified
fixed size in a packet payload is considered to be a possible worm signature. If
the string occurs frequently (measured using a heavy-hitters estimator) and has a
large number of associated unique source and destination IP addresses (measured
by the sequential flow size estimator in [63]) the string is considered as a possible
worm signature and is subjected to further offline tests. The sequential hash-based
9 Hash-Based Techniques for High-Speed Packet Processing 213
9.7 Conclusion
work in this general area. Many new and potentially interesting ideas have been
consistently produced in recent years on the theoretical side; finding the right ap-
plications and proving the value of these ideas could be extremely beneficial. Also,
those building applications need to be able to understand and predict the behavior of
schemes they wish as well as evaluate tradeoffs between schemes they wish to im-
plement in advance, based on sound theoretical principles. On the other side, those
working at a more theoretical level need to pay attention to the needs and require-
ments of applications, including possibly some details of hardware implementation.
A second high-level direction, more speculative and far-reaching, is to con-
sider whether a hashing infrastructure could be developed to support hash-based
approaches for high-speed packet processing. Hash-based approaches offer great
value, including relative simplicity, flexibility, and cost-effectiveness. While not ev-
ery packet-processing task can naturally be placed in a hashing framework, as this
survey shows, a great many can. One could imagine having some standardized, flex-
ible, programmable hashing architecture for Internet devices, designed not for a
specific task or algorithm, but capable of being utilized for many hash-based data
structures or algorithms. The goal of such an infrastructure would not only be to
handle issues that have already arisen in today’s network, but also to provide a gen-
eral framework for handling additional, currently unknown problems that may arise
in the future. Additionally, a potential key value in a standardized hashing infras-
tructure lies in not only its potential use for monitoring or measuring individual
routers, links, or other components, but the network as a whole.
Acknowledgements Adam Kirsch and Michael Mitzenmacher received support for this work
from NSF grant CNS-0721491 and a research grant from Cisco Systems, Inc. George Varghese
received support from NSF grant 0509546 and a grant from Cisco Systems, Inc.
References
1. Y. Azar, A. Broder, A. Karlin, and E. Upfal. Balanced Allocations. SIAM Journal on Comput-
ing, 29(1):180–200, 1999.
2. B. Bloom. Space/Time Tradeoffs in Hash Coding with Allowable Errors. Communications of
the ACM, 13(7):422–426, 1970.
3. F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. Beyond Bloom Filters:
From Approximate Membership Checks to Approximate State Machines. In Proceedings of
ACM SIGCOMM, pp. 315–326, 2006.
4. F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. Bloom Filters via
d -left Hashing and Dynamic Bit Reassignment. In Proceedings of the Allerton Conference on
Communication, Control and Computing, 2006.
5. F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. An Improved Con-
struction for Counting Bloom Filters. In Proceedings of the 14th Annual European Symposium
on Algorithms (ESA), pp. 684–695, 2006.
6. R. Brent. Reducing the Retrieval Time of Scatter Storage Techniques. Communications of the
ACM, 16(2), pp. 105–109, 1973.
7. A. Broder and A. Karlin. Multilevel Adaptive Hashing. In Proceedings of the 1st ACM-SIAM
Symposium on Discrete Algorithms (SODA), pp. 43–53, 1990.
9 Hash-Based Techniques for High-Speed Packet Processing 215
32. G. Gonnet. Expected Length of the Longest Probe Sequence in Hash Code Searching. Journal
of the Association for Computing Machinery, 28(2):289–304, 1981.
33. J. Hasan, S. Cadambi, V. Jakkula, and S. Chakradhar. Chisel: A Storage-efficient, Collision-
free Hash-based Network Processing Architecture. In Proceedings of the 33rd International
Symposium on Computer Architecture (ISCA), pp. 203–215, 2006
34. A. Kirsch and M. Mitzenmacher. On the Performance of Multiple Choice Hash Tables with
Moves on Deletes and Inserts. In Proceedings of the Forty-Sixth Annual Allerton Conference,
2008.
35. A. Kirsch and M. Mitzenmacher. The Power of One Move: Hashing Schemes for Hardware.
In Proceedings of the 27th IEEE International Conference on Computer Communications
(INFOCOM), 2008.
36. A. Kirsch and M. Mitzenmacher. Simple Summaries for Hashing with Choices. IEEE/ACM
Transactions on Networking, 16(1):218–231, 2008.
37. A. Kirsch and M. Mitzenmacher. Using a Queue to De-amortize Cuckoo Hashing in Hardware.
In Proceedings of the Forty-Fifth Annual Allerton Conference on Communication, Control, and
Computing, 2007.
38. A. Kirsch, M. Mitzenmacher, and U. Wieder. More Robust Hashing: Cuckoo Hashing with a
Stash. To appear in Proceedings of the 16th Annual European Symposium on Algorithms, 2008.
39. D. Knuth. Sorting and Searching, vol. 3 of The Art of Computer Programming (2nd edition),
Addison-Wesley Publishing Company, 1998.
40. R. R. Kompella, S. Singh, and G. Varghese. On Scalable Attack Detection in the Network. In
Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 187–200,
2004.
41. A. Kumar, M. Sung, J. Xu, and J. Wang. Data Streaming Algorithms for Efficient and Accurate
Estimation of Flow Size Distribution. In Proceedings of the Joint International Conference on
Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pp. 177–188,
2004.
42. S. Kumar and P. Crowley. Segmented Hash: An Efficient Hash Table Implementation for High
Performance Networking Subsystems. In Proceedings of the 2005 ACM Symposium on Archi-
tecture for Networking and Communications Systems (ANCS), pp. 91–103, 2005.
43. S. Kumar, J. Turner, and P. Crowley. Peacock Hash: Fast and Updatable Hashing for High
Performance Packet Processing Algorithms. In Proceedings of the 27th IEEE International
Conference on Computer Communications (INFOCOM), 2008.
44. S. Kumar, J. Turner, P. Crowley, and M. Mitzenmacher. HEXA: Compact Data Structures for
Faster Packet Processing. In Proceedings of the Fifteenth IEEE International Conference on
Network Protocols (ICNP), pp. 246–255, 2007.
45. Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter Braids:
A Novel Counter Architecture for Per-Flow Measurement. Proceedings of the 2008 ACM
SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS), 2008.
46. Y. Lu, B. Prabhakar, and F. Bonomi. Perfect Hashing for Networking Algorithms. In Proceed-
ings of the 2006 IEEE International Symposium on Information Theory (ISIT), pp. 2774–2778,
2006.
47. Y. Lu, M. Wang, B. Prabhakar, and F. Bonomi. ElephantTrap: A Low Cost Device for Identi-
fying Large Flows. In Proceedings of the 15th Annual IEEE Symposium on High-Performance
Interconnects (HOTI), pp. 99–108, 2007.
48. S. Lumetta and M. Mitzenmacher. Using the Power of Two Choices to Improve Bloom Filters.
To appear in Internet Mathematics.
49. U. Manber and S. Wu. An Algorithm for Approximate Membership checking with Application
to Password Security. Information Processing Letters, 50(4), pp. 191–197, 1994.
50. M. Mitzenmacher. Compressed Bloom Filters. IEEE/ACM Transactions on Networking,
10(5):613–620, 2002.
51. M. Mitzenmacher, A. Richa, and R. Sitaraman. The Power of Two Choices: A Survey of
Techniques and Results, edited by P. Pardalos, S. Rajasekaran, J. Reif, and J. Rolim. Kluwer
Academic Publishers, Norwell, MA, 2001, pp. 255–312.
9 Hash-Based Techniques for High-Speed Packet Processing 217
52. M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and
Probabilistic Analysis. Cambridge University Press, 2005.
53. M. Mitzenmacher and S. Vadhan. Why Simple Hash Functions Work: Exploiting the Entropy
in a Data Stream. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete
Algorithms (SODA), pp. 746–755, 2008.
54. A. Östlin and R. Pagh. Uniform Hashing in Constant Time and Linear Space. In Proceedings
of the Thirty-Fifth Annual ACM Symposium on Theory of Computing (STOC), pp. 622–628,
2003.
55. A. Pagh, R. Pagh, and S. S. Rao. An Optimal Bloom Filter Replacement. In Proceedings of the
Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 823-829, 2005.
56. R. Pagh and F. Rodler. Cuckoo Hashing. Journal of Algorithms, 51(2):122–144, 2004.
57. R. Panigrahy. Efficient Hashing with Lookups in Two Memory Accesses. In Proceedings of
the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 830–839,
2005.
58. F. Putze, P. Sanders, and J. Singler. Cache-, Hash-and Space-Efficient Bloom Filters. In Pro-
ceedings of the Workshop on Experimental Algorithms, pp. 108–121, 2007. Available as
Springer Lecture Notes in Computer Science, volume 4525.
59. M. V. Ramakrishna. Hashing Practice: Analysis of Hashing and Universal Hashing. In Pro-
ceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pp.
191–199, 1988.
60. M. V. Ramakrishna. Practical Performance of Bloom Filters and Parallel Free-Rext Searching.
Communications of the ACM, 32(10):1237–1239, 1989.
61. M.V. Ramakrishna, E. Fu, and E. Bahcekapili. Efficient Hardware Hashing Functions for High
Performance Computers. IEEE Transactions on Computers, 46(12):1378–1381, 1997.
62. A. Siegel. On Universal Classes of Extremely Random Constant-Time Hash Functions. Siam
Journal on Computing, 33(3):505–543, 2004.
63. S. Singh, C. Estan, G. Varghese, and S. Savage. Automated Worm Fingerprinting. In Proceed-
ings of the 6th ACM/USENIX Symposium on Operating System Design and Implementation
(OSDI), 2004.
64. A. Snoeren, C. Partridge, L. Sanchez, C. Jones, F. Tchakountio, B. Schwartz, S. Kent, and
W. Strayer. Single-Packet IP Traceback. IEEE/ACM Transactions on Networks, 10(6):721–
734, 2002.
65. H. Song, S. Dharmapurikar, J. Turner, and J. Lockwood. Fast Hash Table Lookup Using Ex-
tended Bloom Filter: An Aid to Network Processing. In Proceedings of ACM SIGCOMM, pp.
181–192, 2005.
66. D. Thaler and C. Hopps. Multipath Issues in Unicast and Multicast Next-Hop Selection. RFC
2991, 2000. Available at ftp://ftp.rfc-editor.org/in-notes/rfc2991.txt.
67. M. Thorup. Even Strongly Universal Hashing is Pretty Fast. In Proceedings of the Eleventh
Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 496–497, 2000.
68. GIGAswitch System: A High-Performance Packet-Switching Platform. R. J. Souza, P. G.
Krishnakumar, C. M. Özveren, R. J. Simcoe, B. A. Spinney, R. E. Thomas, and R. J. Walsh.
Digital Technical Journal, 6(1):9–22, 1994.
69. V. Srinivasan and G. Varghese. Fast Address Lookups Using Controlled Prefix Expansion.
ACM Transactions on Computer Systems, 17(1):1–40, 1999.
70. G. Varghese. Network Algorithmics: An Interdisciplinary Approach to Designing Fast Net-
worked Devices. Morgan Kaufmann Publishers, 2004.
71. J. S. Vitter. Random Sampling with a Reservoir. ACM Transactions on Mathematical Software,
11(1):37–57, 1985.
72. S. Venkataraman, D. Song, P. B. Gibbons, and A. Blum. New Streaming Algorithms for Fast
Detection of Superspreaders. In Proceedings of the 12th ISOC Symposium on Network and
Distributed Systems Security (SNDSS), 149–166, 2005.
73. B. Vöcking. How Asymmetry Helps Load Balancing. Journal of the ACM, 50(4):568–589,
2003.
74. M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable High Speed IP Routing
Lookups. ACM SIGCOMM Computer Communication Review, 27(4):25–36, 1997.
218 A. Kirsch et al.
75. M. N. Wegman and J. L. Carter. New Hash Functions and Their Use in Authentication and Set
Equality. Journal of Computer and System Sciences, 22(3):265–279, 1981.
76. P. Woelfel. Asymmetric Balanced Allocation with Simple Hash Functions. In Proceedings of
the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm (SODA), pp. 424–433,
2006.
Chapter 10
Fast Packet Pattern-Matching Algorithms
Abstract Packet content scanning at high speed has become extremely important
due to its applications in network security, network monitoring, HTTP load bal-
ancing, etc. In content scanning, the packet payload is compared to a set of patterns
specified as regular expressions. In this chapter, we first describe the typical patterns
used in packet-scanning applications and show that for some of these patterns the
memory requirements can be prohibitively high when traditional matching methods
are used. We then review techniques for efficient regular expression matching and
explore regular expression rewrite techniques that can significantly reduce mem-
ory usage. Based on new rewrite insights, we propose guidelines for pattern writers
to make matching fast and practical. Furthermore, we discuss deterministic finite
automaton (DFA) link compression techniques and review algorithms and data
structures that are specifically designed for matching regular expressions in net-
working applications.
10.1 Motivation
F. Yu ()
Microsoft Research Silicon Valley, 1065 La Avenida, Mountain View, CA 94043
e-mail: [email protected]
Y. Diao
Department of Computer Science, University of Massachussets Amherst,
140 Governors Drive Amherst, MA 01003
e-mail: [email protected]
R.H. Katz
Electrical Engineering and Computer Science Department, University of California Berkeley,
Berkeley, CA 94720
e-mail: [email protected]
T.V. Lakshman
Bell-Labs, Alcatel-Lucent, 600 Mountain Avenue, Murray Hill NJ 07974
e-mail: [email protected]
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 219
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 10,
c Springer-Verlag London Limited 2010
220 F. Yu et al.
newly emerging edge network services. Examples of the emerging edge network
services include high-speed firewalls, which protect end hosts from security attacks;
HTTP load balancing, which smartly redirects packets to different servers based on
their HTTP requests and Extensible Markup Language (XML) processing, which
facilitates the sharing of data across different systems.
In packet-scanning systems, the payload of packets in a traffic stream is matched
against a given set of patterns to identify specific classes of applications, viruses,
protocol definitions, and so on. When viruses and worms were simple, patterns
could be expressed using simple strings format, e.g., “GetInfonx0d ” is the signature
for a back door attack [3]. However, as viruses and worms became more complex,
they rendered simple pattern-matching approaches inadequate for sophisticated pay-
load scanning. For example, polymorphic worms make it impossible to enumerate
all the possible signatures using explicit strings. Regular expressions appear to be a
suitable choice for these patterns due to their rich expressive power. Consider a reg-
ular expression, “ˆEntry=file=Œ0 9:f71; g==: nx0Aannotatenx0A”, for detecting a
Concurrent Versions System (CVS) revision overflow attack [12]. This pattern first
searches for a fixed string pattern “Entry=file=” followed by 71 or more digits or
dots, then a fixed pattern “//” followed by some arbitrary characters (:), and finally
the pattern “nx0Aannotatenx0A”. Obviously, it is very hard to enumerate this type
of attack using fixed string patterns.
As a result, regular expressions are replacing explicit string patterns as the
pattern-matching language of choice in packet-scanning applications. In the Linux
Application Protocol Classifier (L7-filter) [12], all protocol identifiers are expressed
as regular expressions. Similarly, the SNORT [3] intrusion detection system has
evolved from no regular expressions in its rule set in April 2003 (Version 2.0) to
1131 out of 4867 rules using regular expressions as of February 2006 (Version 2.4).
Another intrusion detection system, Bro [1], also uses regular expressions as its
pattern language.
In this chapter, we present a number of regular expression pattern-matching
schemes. We begin with a brief introduction to regular expressions with simple
examples in Section 10.2. We briefly survey traditional regular expression match-
ing methods in Section 10.3. We then analyze the special characteristics of typical
regular expression patterns used in network scanning applications in Section 10.4.
We show that some of the complex patterns lead to exponential memory usage or
low matching speed when using the traditional methods. Based on this observation,
we propose rewrite rules for two common types of complex regular expressions
in Section 10.5. The rewrite rules can dramatically reduce the sizes of resulting
deterministic finite automatons (DFAs), making them small enough to fit in high-
speed memory. In Section 10.6, we review DFA compression techniques that can
further reduce memory consumption. Finally, in Section 10.7, we discuss some
advanced DFA processing techniques developed specifically for high-speed router
implementation.
10 Fast Packet Pattern-Matching Algorithms 221
Finite automata are a natural formalism for regular expression matching. There
are two main categories: deterministic finite Automaton (DFA) and Nondetermin-
istic Finite Automaton (NFA). This section provides a brief survey of existing
methods using these two types of automata.
10.3.1 DFA
A DFA consists of a finite set of input symbols, denoted as ˙, a finite set of states,
and a transition function ı [8]. In networking applications, ˙ contains the 28 sym-
bols from the extended ASCII code. Among the states, there is a single start state
and a set of accepting states. The transition function ı takes a state and an input
symbol as arguments and returns a state. A key feature of DFA is that at any time
there is at most one active state in the DFA.
Figure 10.1 shows a simple DFA for regular expression ..AjB/C j.AjD/E/,
which matches string AC , BC , AE, or DE. If the given string is BC , it will first go
to State 1 based on character B, then it will arrive at the final accept state (State 2)
based on character C . Given another input starting with A, the DFA will first go
to State 5. Depending on whether the next input is B or E, it will transition to the
corresponding accept state, that is, State 2 or State 4. If the next input is neither B
nor E, DFA will report the result of no-match for the given string.
10.3.2 NFA
An NFA is similar to a DFA except that the ı function maps from a state and a
symbol to a set of new states. Therefore, multiple states can be active simultaneously
in an NFA.
Figure 10.2 shows the NFA for the previous example ..AjB/C j.AjD/E/. Unlike
a DFA, for the NFA given an input starting with A, two states will be active at the
same time (State 1 and State 3). State 1 means we have already seen the prefix
C
1 2
B C
A
Start 5
E
D
3 E 4
C
1 2
A| B
Start
A |D
3 E 4
pattern .AjB/, now waiting for the character C . State 3 means we have seen .AjD/,
now waiting for the character E. Depending on the next input character, the NFA
will go to State 2 if given C , go to State 4 if given E, or fail to match the regular
expression if given any other character.
Using automata to recognize regular expressions introduces two types of com-
plexity: automata storage and processing costs. A theoretical worst-case study [8]
shows that a single regular expression of length n can be expressed as an NFA with
O.n/ states. When the NFA is converted into a DFA, it may generate O.˙ n / states,
where ˙ is the set of symbols. The processing complexity for each character in the
input is O.1/ in a DFA, but is O.n2 / for an NFA when all n states are active at the
same time.
To handle m regular expressions, two choices are possible: processing them in-
dividually in m automata, or compiling them into a single automaton. The former
is used in Snort [3] and Linux L7-filter [12]. The latter is proposed in recent stud-
ies [5, 6] so that the single composite NFA can support shared matching of common
prefixes of those expressions. Despite the demonstrated performance gains over us-
ing m separate NFAs, in practice this approach still experiences large numbers of
active states. This has the same worst-case complexity as the sum of m separate
NFAs. Therefore, this approach on a serial processor can be slow, as given any input
character, each active state must be serially examined to obtain new states.
In DFA-based systems, compiling m regular expressions into a composite DFA
provides guaranteed performance benefit over running m individual DFAs. Specifi-
cally, a composite DFA reduces processing cost from O.m/ (cost of O.1/ for each
automaton) to O.1/, i.e., a single lookup to obtain the next state for any given char-
acter. However, the number of states in the composite automaton grows to O.˙ mn /
in the theoretical worst case as listed in Table 10.2.
224 F. Yu et al.
There is a middle ground between DFA and NFA, called lazy DFA. Lazy DFAs
are designed to reduce the memory consumption of conventional DFAs [7, 14]: a
lazy DFA keeps a subset of the DFA states that match the most common strings in
memory; for uncommon strings, it extends the subset from the corresponding NFA
at runtime. As such, a lazy DFA is usually much smaller than the corresponding fully
compiled DFA and provides good performance for common input strings. The Bro
intrusion detection system [1] adopts this approach. However, malicious senders can
easily construct packets with uncommon strings to keep the system busy and slow
down the matching process. As a result, the system will start dropping packets and
malicious packets can sneak through.
The previous section surveyed the most representative traditional approaches to reg-
ular expression matching. In this section, we show that these techniques do not
always work efficiently for some of the complex patterns that arise in networking
applications. To this end, we first enumerate the pattern structures that are common
in networking applications in Section 10.4.1. We then show that certain pattern-
structures that occur in networking applications are hard to match using traditional
methods – they incur either excessive memory usage or high computation cost. In
particular, we show two categories of regular expressions in Section 10.4.2 that lead
to quadratic and exponential numbers of states, respectively.
We study the complexity of DFAs for typical patterns used in real-world packet
payload scanning applications such as Linux L7-filter (as of Feb 2006), SNORT
(Version 2.4), and Bro (Version 0.8V88). Table 10.3 summarizes the results.1
Explicit strings generate DFAs of size linear in the number of characters in the
pattern. Twenty-five percent of the networking patterns, in the three applications
we studied (Linux L7-filter, SNORT, and Bro), fall into this category and they
generate relatively small DFAs with an average of 24 states.
If a pattern starts with ‘ˆ’, it creates a DFA of polynomial complexity with respect
to the pattern length k and the length restriction j on the repetition of a class of
1
This study is based on the use of exhaustive matching and one-pass search defined in [16].
10 Fast Packet Pattern-Matching Algorithms 225
characters in the pattern. Our observation from the existing payload scanning
rule sets is that the pattern length k is usually limited. The length restriction j
is usually small too, unless it is for buffer overflow attempts. In that case, j will
be more than 300 on average and sometimes even reaches thousands. Therefore,
Case 4 in Table 10.3 can result in a large DFA because it has a factor quadratic
in j . Although this type of pattern only constitutes 5.1% of the total patterns,
they create DFAs with an average of 136,903 states.
There are also a small percent (6.8%) of patterns starting with “:” and having
length restrictions (Case 5). These patterns create DFAs of exponential sizes. We
will address Cases 4 and 5 in detail in Section 10.4.2.
We compare the regular expressions used in three networking applications,
namely, SNORT, Bro, and the Linux L7-filter, against those used in emerging
Extensible Markup Language (XML) filtering applications [5, 6] where regular ex-
pressions are matched over text documents encoded in XML. The results are shown
in Table 10.4. We observe three main differences:
(1) While both types of applications use wildcards (‘:’, ‘‹’, ‘C’, ‘’), the pat-
terns for packet-scanning applications contain larger numbers of them. Many
such patterns use multiple wildcard metacharacters (e.g., ‘:’, ‘’). For example,
the pattern for identifying the Internet radio protocol, “membername:session:
player”, has two wildcard fragments “:”. Some even contain over 10 such
wildcard fragments. As regular expressions are converted into state machines
for pattern matching, large numbers of wildcards bring multiple matching
choices to the matching process, causing the corresponding DFAs to grow
exponentially.
226 F. Yu et al.
(2) Classes of characters (“[]”) are used in packet-scanning applications, but not
in XML processing applications. In addition, the class of characters may
intersect with other classes or wildcards. For example, the pattern for detect-
ing buffer overflow attacks to the Network News Transport Protocol (NNTP)
is “ˆSEARCH ns C Œˆnnf1024g”, where a class of character “[ˆnn]” interacts
with its preceding white space characters “nsC”. When given an input with
SEARCH followed by a series of white spaces, there is ambiguity whether
these white spaces match nsC or the non-return class “[ˆnn]”. As we will show
later in Section 10.4.2.1, such interaction can result in a highly complex state
machine.
(3) A high percentage of patterns in packet payload scanning applications have
length restrictions on some of the classes or wildcards, while such length re-
strictions usually do not occur in XML filtering. For example, the pattern for
detecting Internet Message Access Protocol (IMAP) email server buffer over-
flow attack is as follows “: AUTHns[ˆnn]f100g”. This pattern contains the
restriction that there would be 100 non-return characters “[ˆnn]” after matching
of keyword AUTH and any number of white spaces “ns”. As we shall show in
Section 10.4.2.2, such length restrictions can increase the resource needs for
regular expression matching.
A common misconception is that patterns starting with ‘ˆ’ create simple DFAs. In
fact, even in the presence of ‘ˆ’, classes of characters that overlap with the prefix
10 Fast Packet Pattern-Matching Algorithms 227
B
Not B not \n
3 6
D
B Not B not \ n
D
2 5 9
Not B not \ n Not \ n D
B
Acc-
B 8 10 ept
Start 1 7 Not \ n D
Not \ n
Not B not \ n
Fig. 10.3 A DFA for pattern ˆB+[ˆnn]3D that generates quadratic number of states
pattern can still yield a complex DFA. Consider the pattern “ˆBC[ˆnn]f3gD”, where
the class of character [ˆnn] denotes any character but the return character ‘nn’.
Figure 10.3 shows that the corresponding DFA has a quadratic number of states.
The quadratic complexity comes from the fact that the letter B overlaps with the
class of character [ˆnn] and, hence, there is inherent ambiguity in the pattern: the
second B letter can be matched either as part of BC, or as part of [ˆnn]f3g. There-
fore, if an input contains multiple Bs, the DFA needs to remember the number of Bs
it has seen and their locations i to make a correct decision with the next input char-
acter. If the class of characters has length restriction of j bytes, DFA needs O.j 2 /
states to remember the combination of distance to the first B and the distance to
the last B.
Seventeen patterns in the SNORT rule set fall into this quadratic state cate-
gory. For example, the regular expression for the NNTP rule is “ˆSEARCHnsC[ˆnn]
f1024g”. Similar to the example in Figure 10.3, ns overlaps with ˆ[nn]. White space
characters cause ambiguity of whether they should match nsC or be counted as part
of the 1024 non-return characters [ˆnn]f1024g. For example, an input of SEARCH
followed by 1024 white spaces and then 1024 ‘A’s will have 1024 ways of match-
ing strings, i.e., one white space matches nsC and the rest as part of [ˆnn]f1024g,
or two white spaces match nsC and the rest as part of [ˆnn]f1024g, and so on. By
using 10242 states to remember all possible sequences of these white spaces, the
DFA accommodates all the ways to match the substrings of different lengths. Note
that all these substrings start with SEARCH and hence are overlapping matches.
This type of quadratic state problem cannot be solved by an NFA-based ap-
proach. Specifically, the corresponding NFA contains 1042 states; among these, the
first six are for the matching of SEARCH, the next one for the matching of nsC,
and the rest of the 1024 states for the counting of [ˆnn]1024 with one state for each
count. An intruder can easily construct an input as SEARCH followed by 1024 white
spaces. With this input, both the nsC state and all the 1023 non-return states would
228 F. Yu et al.
be active at the same time. Given the next character, the NFA needs to check these
1024 states sequentially to compute a new set of active states, hence significantly
slowing down the pattern-matching speed.
In real life, many payload scanning patterns contain an exact distance requirement.
Figure 10.4 shows the DFA for an example pattern “: A::CD”. An exponential
number of states (22C1 ) are needed to represent these two wildcard characters. This
is because we need to remember all possible effects of the preceding As as they
may yield different results when combined with subsequent inputs. For example, an
input AAB is different from ABA because a subsequent input BCD forms a valid
pattern with AAB (AABBCD), but not so with ABA (ABABCD). In general, if a pat-
tern matches exactly j arbitrary characters, O.2j / states are needed to handle the
requirement that the distance exactly equals j . This result is also reported in [6].
Similar results apply to the case where the class of characters overlaps with the
prefix, e.g., “: AŒA Zfj gD”.
Similar structures exist in real-world pattern sets. In the intrusion detection sys-
tem SNORT, 53.8% of the patterns (mostly for detecting buffer overflow attempts)
contain a fixed length restriction. Around 80% of the rules start with ‘ˆ’; hence,
they will not cause exponential growth of DFA. The remaining 20% of the patterns
do suffer from the state explosion problem. For example, consider the rule for de-
tecting IMAP authentication overflow attempts, which uses the regular expression
“: AUTHnsŒˆnnf100g”. This rule detects any input that contains AUTH, then a
white space, and no return character in the following 100 bytes. If we directly com-
pile this pattern into a DFA, the DFA will contain more than 10;000 states because
A Not A Not A or C
4
C
A
Not A
2 5
C
Not A A
D Acc-
A Not A 6 C 8
Start 1 3 ept
Not A
A
C
Not A or D 7
Fig. 10.4 A DFA for pattern : A::CD that generates exponential number of states
10 Fast Packet Pattern-Matching Algorithms 229
1 2 3 4 5 6 104 105
Start A U T H \s 97
Not \n Not \n
Similar
states
Having identified the typical patterns that yield large DFAs, in this section we in-
vestigate possible rewriting of some of those patterns to reduce the DFA size. Such
rewriting is enabled by relaxing the requirement of exhaustive matching to that of
non-overlapping matching (Section 10.5.1). With this relaxation, we propose pat-
tern rewriting techniques that explore the potential of trading off exhaustive pattern
matching for memory efficiency for quadratic patterns (Section 10.5.2) and expo-
nential patterns (Section 10.5.3). Finally, we offer guidelines to pattern writers on
how to write patterns amenable to efficient implementation (Section 10.5.4).
regular expression, if the string is matched from start to end by a DFA corresponding
to that regular expression. In contrast, in packet payload scanning, a regular expres-
sion pattern can be matched by the entire input or specific substrings of the input2 .
Without a priori knowledge of the starting and ending positions of those substrings
(unless the pattern starts with ‘ˆ’ that restricts it to be matched at the beginning of
the line, or ends with ‘$’ that limits it to be matched at the end of the line), the
DFAs created for recognizing all substring matches can be highly complex. This is
because the DFA needs to remember all the possible subprefixes it has encountered.
When there are many patterns with a lot of wildcards, they can be simultaneously
active (recognizing part of the pattern). Hence, a DFA needs many states to record
all possible combinations of partially matched patterns.
For a better understanding of the matching model, we next present a few concepts
pertaining to the completeness of matching results and the DFA execution model for
substring matching. Given a regular expression pattern and an input string, a com-
plete set of results contains all substrings of the input that the pattern can possibly
match. For example, given a pattern “ab” and an input abbb, four possible matches
can be reported, a, ab, abb, and abbb. We call this style of matching Exhaustive
Matching. It is formally defined as below:
Exhaustive Matching Consider the matching process M as a function from a pat-
tern P and a string S to a power set of S , such that M.P; S / = fsubstring S 0 of S j,
S 0 is accepted by the DFA of P g.
In practice, it is expensive and often unnecessary to report all matching sub-
strings, as most applications can be satisfied by a subset of those matches. For
example, if we are searching for the Oracle user name buffer overflow attempt, the
pattern may be “ˆ USRnsŒ ˆ nnf100; g”, which searches for packets starting with
“USRns” and followed by 100 or more non-return characters. An incoming packet
with “USRns” followed by 200 non-return characters may have 100 ways of match-
ing the pattern because each combination of the “USRns” with the sequential 100
to 200 characters is a valid match of the pattern. In practice, reporting just one of
the matching results is sufficient to detect the buffer overflow attack. Therefore, we
propose a new concept, Non-overlapping Matching, that relaxes the requirements of
exhaustive matching.
Non-overlapping Matching Consider the matching process M as a function from
a pattern P and a string S to a set of strings, specifically,
T
M.P; S / D fsubstring Si of S j8Si ; Sj accepted by the DFA of P; Si Sj D;g.
If a pattern appears in multiple locations of the input, this matching process re-
ports all non-overlapping substrings that match the pattern. We revisit our example
above. For the pattern ab and the input abbb, the four matches overlap by sharing
2
The techniques presented in this chapter assume packets are reassembled into a stream before
checking for patterns. For pattern matching on out of order packets, please refer to [9].
10 Fast Packet Pattern-Matching Algorithms 231
If there is a ‘nn’ character within the next 100 bytes, the return character must
also be within 100 bytes to the second AU TH ns.
If there is no ‘nn’ character within the next 100 bytes, the first AU TH ns and the
following characters have already matched the pattern.
The intuition is that we can rewrite the pattern such that it only attempts to cap-
ture one match of the prefix pattern. Following the intuition, we can simplify the
DFA by removing the states that deal with the successive AUTHns. As shown in
Figure 10.6, the simplified DFA first searches for AUTH in the first four states, then
looks for a white space, and after that starts to count and check whether the next 100
bytes contains a return character. After rewriting, the DFA only contains 106 states.
The rewritten pattern can be derived from the simplified DFA shown in
Figure 10.6. We can transform this DFA to an equivalent NFA in Figure 10.7
using standard automaton transform techniques [8]. The transformed NFA can be
directly described using the following regular expression:
([ˆA]jA[ˆU ]jAU[ˆT ]jAUT[ˆH ] jAUTH[ˆns]jAUTHns[ˆnnf0; 99gnn/ AUTH
ns[ˆnnf100g
This pattern first enumerates all the cases that do not satisfy the pattern and
then attaches the original pattern to the end of the new pattern. In other words,
Not A
Not H \n \n \n
Not U Not T Not \s
Start 1 2 3 4 5 6 104 105
A U T H \s Not \ n 97 Not \ n
Similar
states
AUT[^H]
AU[^T]
A[^U]
Not A
AUTH\s[^\n]{0,99}\n
“:” is replaced with the cases that do not match the pattern, represented by
([ˆA]jA[ˆU ]jAU[ˆT ]jAUT [ˆH ]jAUTH[ˆns]jAUTHns[ˆnnf0; 99gnn/
Then, when the DFA comes to the states for AUTHnsŒˆnnf100g, it must be able
to match the pattern. Since the rewritten pattern is directly obtained from a DFA of
size j C 5, it generates a DFA of a linear number of states rather than an exponential
number before applying the rewrite.
More generally, it is proven in [16] that pattern “: ABŒA Zfj g” can be rewrit-
ten as “([ˆA]jA[ˆB]jABŒA Zfj 1g[ˆ(A Z)])ABŒA Zfj g” for detecting
non-overlapping strings. Similar rewrite rules apply to patterns in other forms of
length restriction, e.g., “: ABŒA Zfj Cg”.
In [16], these two rewriting rules are applied to the Linux L7-filter, SNORT, and
Bro pattern sets. While the Linux L7-filter pattern set does not contain any pattern
that needs to be rewritten, the SNORT pattern set contains 71 rules that need to be
rewritten and the Bro pattern set contains 49 such patterns (mostly imported from
SNORT). For both types of rewrite, the DFA size reduction rate is over 98%.
From the analysis, we can see that patterns with length restrictions can sometimes
generate large DFAs. In typical packet payload scanning pattern sets including
Linux L7-filter, SNORT, and Bro, 21.4–56.3% of the length restrictions are asso-
ciated with classes of characters. The most common of these are “[ˆnn]”, “[ˆn]]”
(not ‘]’), and “[ˆn”]” (not ”), used for detecting buffer overflow attempts. The length
restrictions of these patterns are typically large (233 on the average and reaching up
to 1024). For these types of patterns, we highly encourage the pattern writer to add
“ˆ” so as to avoid the exponential state growth. For patterns that cannot start with
“ˆ”, the pattern writers can use the techniques shown in Section 10.5.3 to generate
patterns with linear numbers of states to the length restriction requirements.
Even for patterns starting with ‘ˆ’, we need to avoid the interactions between a
character class and its preceding character, as shown in Section 10.5.2. One may
wonder why a pattern writer uses nsC in the pattern “ˆSEARCHnsC[ˆnn]f1024g”,
when it can be simplified as ns. Our understanding is that, in reality, a server imple-
mentation of a search task usually interprets the input in one of the two ways: either
skips a white space after SEARCH and takes the following up to 1024 characters
to conduct a search, or skips all white spaces and takes the rest for the search. The
original pattern writer may want to catch intrusion into systems of either implemen-
tation. However, the way the original pattern is written, it generates false positives
if the server does the first type of implementation (skipping all the white spaces).
This is because if an input is followed by 1024 white spaces and then some non-
whitespace regular command of less than 1024 bytes, the server can skip these white
spaces and take the follow-up command successfully. However, this legitimate input
will be caught by the original pattern as an intrusion because these white spaces
234 F. Yu et al.
themselves can trigger the alarm. To catch attacks to this type of server implemen-
tation, while not generating false positives, we need the following pattern:
“ˆSEARCHnsC[ˆns][ˆnn]f1023g”
In this pattern, nsC matches all white spaces and [ˆns] means the first non-white
space character. If there are more than 1023 non-return characters following the first
non-white space character, it is a buffer overflow attack. By adding [ˆns], the am-
biguity in the original pattern is removed; given an input, there is only one way to
match each packet. As a result, this new pattern generates a DFA of linear size. To
generalize, we recommend pattern writers to avoid all the possible overlaps between
the neighboring segments in the pattern. Here, overlap denotes an input can match
both segments simultaneously, e.g., nsC and [ˆnn]. Overlaps will generate a large
number of states in a DFA because the DFA needs to enumerate all the possible
ways to match the pattern.
The pattern rewriting schemes presented in the previous section reduce the DFA
storage overhead by reducing the number of DFA states. Besides the states, the DFA
storage overhead is also affected by the links between states. This section discusses
link compression techniques.
If no link compression is applied, each state in the DFA has 28 D 256 possible
outgoing links, one for each ASCII alphabet input. Usually not all outgoing links
are distinct. Therefore, table compression techniques can be used to efficiently rep-
resent the identical outgoing links [4]. However, these techniques are reported to
be inefficient when applied to networking patterns because on the average one state
has more than 50 distinct next states [11].
Kumar et al. proposed Delayed Input DFA (D2 FA), a new representation of reg-
ular expressions for reducing the DFA storage overhead. Instead of compressing
identical links originated from the one state, it compresses links across states based
on the observation that multiple states in a DFA can have identical outgoing links.
Therefore, linking these states together through default transitions can remove the
need of storing outgoing links in each state separately. For a concrete example, con-
sider three states s1 , s2 , and s3 and their outgoing links in Table 10.5. States s1 and
a b c
S2 S2 S2
3 3 3
3 3
S1 S1 S1
2
2 S3
S3 S3
s2 have identical next states on inputs A, C , and D. Only character B leads to differ-
ent next states. Similarly, s2 and s3 have identical next states except for character C .
Instead of these three states storing next states separately, D2 FA only stores one out-
going link for s2 (for character B), and one for s3 (for character C ). For other states,
s2 can have a default link to s1 and s3 has a default link to s2 , where identical links
are stored. In this way, the storage overhead of D2 FA can be significantly smaller
than the original DFA.
To construct D2 FA from DFA, one can check the number of identical outgoing
links between any two states and use that as a weight function. The weight indicates
the number of links that can be eliminated in D2 FA. Figure 10.8(a) shows weights
in the previous example. The goal of default link selection is to pick default links
between states that shared the highest weights. Note that a default path must not
contain cycles because otherwise it may bring the D2 FA into an infinite loop at some
given input. Therefore, the default paths can create a tree or forests. Figure 10.8(b)
and (c) are two example selections and (b) has a higher weight than (c). In [11],
maximum weight spanning tree algorithms are used to create the default paths and
consequently convert DFA to D2 FA.
The storage savings of D2 FA come at the cost of multiple memory lookups. In
the previous example, if using the DFA, given an input A at current state s2 , we
can obtain the next states with one table lookup. With D2 FA, Figure 10.8(b), two
memory lookups are necessary. First, we perform one lookup to find out that s2 has
no stored outgoing link for character A and we obtain the default link to s1 . Next, we
perform another lookup into s1 to retrieve the next state for A, which is s2 . Default
links can be connected to form a default path. With a default path, multiple memory
lookups are needed. For example, given an input A at state s3 in Figure 10.8(b), we
need two extra memory lookups, one following the default link from s3 to s2 and the
other from s2 to s1 . To generalize, given an input, the number of memory lookups
is the number of default links followed plus one. In the worst case, the longest
default path becomes the system bottleneck. Therefore, when constructing D2 FA
from DFA, it is critical to bound the maximum default paths lengths. A spanning
tree-based heuristic algorithm is used in [11] for this purpose.
236 F. Yu et al.
The previous sections presented rewrite and compression techniques to reduce the
storage overhead of DFAs for fast packet pattern processing. DFA sizes can be
further reduced by using data structures and algorithms particularly suitable for
these packet patterns. In particular, Kumar et al. [10] identified several limitations
of traditional DFA-based approaches to packet pattern processing and proposed
techniques to overcome these limitations, resulting in compact representations of
multiple patterns for high-speed packet content scanning.
A foremost limitation of traditional DFA-based approaches is that they employ
complete patterns to parse the packet content. These approaches fail to exploit the
fact that normal packet streams rarely match more than the first few symbols of any
pattern. As a result, the automata unnecessarily explode in size as they attempt to
represent the patterns in their entirety even if the tail portions of the patterns are
rarely visited. To overcome this limitation, a key idea in [10] is to isolate frequently
visited portions of the patterns, called pattern prefixes, from the infrequent portions,
called pattern suffixes. Another important observation of common packet patterns
is that the prefixes are generally simpler than the suffixes. Hence, such prefixes can
be implemented using a compact DFA representation and stored in a fast memory,
expediting the critical path of packet scanning. On the other hand, the suffixes can
be implemented using DFAs if they fit in memory, or even using NFAs since they
are expected to be executed only infrequently. Such a prefix and suffix-based archi-
tecture is referred to as a bifurcated pattern-matching architecture.
There is an important tradeoff in such a bifurcated pattern-matching architecture:
On the one hand, we want to make the prefixes small so that the automaton that is
active all the time is compact and fast. On the other hand, very small prefixes can be
matched frequently by normal data streams, causing frequent invocations of the slow
processing of the complex suffixes. Hence, a good solution must strike an effective
balance between the two competing goals. The solution proposed in [10] is sketched
below with some simplification:
1. Construct an NFA for each packet pattern and execute all those NFAs against
typical network traffic. For each NFA, compute the probability with which each
state of the NFA becomes active and the probabilities with which the NFA makes
its various transitions. The NFAs for two example patterns and their transition
probabilities are illustrated in Figure 10.9.
2. Once these probabilities are computed, determine a cut in the NFA graph such
that (i) there are as few nodes as possible on the left-hand side of the cut and
(ii) the probability that the states on the right-hand side of the cut are active is
sufficiently small. Such a cut is illustrated in Figure 10.9(b).
10 Fast Packet Pattern-Matching Algorithms 237
a b
* *
A B C D
.* A B .* C D 0 1 2 3 4
0.25 0.2 0.01 0.005
* ^D ?
C D E G
.* C D [^D]* E F* G 0 5 6 7 8
0.2 0.1 0.008 0.005
Fig. 10.9 Example NFAs and the cut between their prefixes and suffixes
3. After the cut is determined, a composite DFA is constructed for all the prefixes
of the NFAs on the left-hand side of the cut. A DFA or NFA is chosen for each
suffix on the right-hand side depending on the available memory.
Experimental results in [10] show that more than 50% reduction of memory usage
can be achieved for a spectrum of pattern sets used in network intrusion detection
systems, when the DFAs for entire patterns are replaced with the DFAs for the pre-
fixes of those patterns obtained using the above technique.
A second limitation of traditional DFA-based approaches is that given a set of
patterns to be matched simultaneously with the input data, a composite DFA main-
tains a single state of execution, which represents the combination of all the partial
matches of those patterns. As a result, it needs to employ a large number of states
to remember various combinations of the partial matches. In particular, an expo-
nential blowup of the DFA can occur when multiple patterns consist of a simple
sequence of characters followed by a Kleene closure [8] over a class of characters,
e.g., the two prefixes on the left-hand side of the cut in Figure 10.9(b). In this sce-
nario, the DFA needs to record the power set of the matching results of such prefixes
using individual states, hence the exponential size of the machine [10, 16].
To mitigate the combinatorial effect of partial matches of multiple patterns, a
history-augmented DFA [10], H-FA, equips the composite DFA with a small auxil-
iary memory that uses a set of history flags to register the events of partial matches.
H-FA further identifies the fragments of the composite DFA that perform similar
processing of the input data and only differ in the True/False values of a history flag.
In many cases, these DFA fragments can be merged with additional conditions im-
posed on the DFA transitions and appropriate set and reset operations of the history
flag. As reported in [10], such H-FAs can result in more than 80% space reduction
for most common pattern sets used in network intrusion detection systems.
238 F. Yu et al.
10.8 Summary
References
Abstract In recent years, network anomaly detection has become an important area
for both commercial interests as well as academic research. Applications of anomaly
detection typically stem from the perspectives of network monitoring and network
security. In network monitoring, a service provider is often interested in capturing
such network characteristics as heavy flows, flow size distributions, and the number
of distinct flows. In network security, the interest lies in characterizing known or
unknown anomalous patterns of an attack or a virus.
In this chapter we review two main approaches to network anomaly detection:
streaming algorithms, and machine learning approaches with a focus on unsuper-
vised learning. We discuss the main features of the different approaches and discuss
their pros and cons. We conclude the chapter by presenting some open problems in
the area of network anomaly detection.
11.1 Introduction
Network anomaly detection has become an important research area for both com-
mercial interests as well as academic research. Applications of anomaly detection
methods typically stem from the perspectives of network monitoring and network
security. In network monitoring, a service provider is interested in capturing such
M. Thottan ()
Bell Labs, Alcatel-Lucent, 600-700 Mountain Avenue, Murray Hill, NJ 07974, USA
e-mail: [email protected]
G. Liu
Division of Mathematics and Sciences, Roane State Community College, 276 Patton Lane,
Harriman, TN 37748, USA
e-mail: [email protected]
C. Ji
School of Electrical and Computer Engineering, Georgia Institute of Technology, 777 Atlantic
Drive, Atlanta, GA 30332, USA
e-mail: [email protected]
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 239
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 11,
c Springer-Verlag London Limited 2010
240 M. Thottan et al.
network characteristics such as heavy flows that use a particular link with a given
capacity, flow size distributions, and the number of distinct flows in the network.
In network security, the interest lies in characterizing known or unknown anoma-
lous patterns of misbehaving applications or a network attack or even a virus in the
packet payload.
A general definition for a network anomaly describes an event that deviates
from normal network behavior. However, since known models for normal network
behavior are not readily available, it is difficult to develop an anomaly detector
in the strictest sense. Based on the inherent complexity in characterizing normal
network behavior, the problem of anomaly detection can be categorized as model-
based and non-model-based. In model-based anomaly detectors, it is assumed that
a known model is available for the normal behavior of certain specific aspects of
the network and any deviation from the norm is deemed an anomaly. For network
behaviors that cannot be characterized by a model, non-model-based approaches
are used. Non-model-based approaches can be further classified according to the
specific implementation and accuracy constraints that have been imposed on the
detector.
In network monitoring applications, where a statistical characterization of net-
work anomalies is required, a known model is not a must. Nevertheless, a statistical
anomaly detector must have access to large volumes of data that can provide the
required samples, from which one can accurately learn the normal network behav-
ior. However, with increasing speeds of network links, the frequency of sampling
that is necessary for achieving a desired accuracy may be infeasible to implement.
For example, on an OC-768 link, packets arrive every 25 ns. Online monitoring
of these packets requires per packet processing along with a large of amount of
state information that must be kept in memory. This is a heavy burden on the
limited memory (SRAM) that is available on the router line cards. The sampling
rates are therefore heavily resource-constrained. Under these circumstances it is
more appropriate to use anomaly detection that can process long streams of data
with small memory requirements and limited state information. Consequently, an
online detection of anomalies with processing resource constraints corresponds to
making some specific queries on the data and this is better handled by discrete al-
gorithms that can process streaming data. In comparison with statistical sampling,
streaming inspects every piece of data for the most important information while
sampling processes only a small percentage of the data and absorbs all the informa-
tion therein [56].
Machine learning approaches can be used when computational constraints are re-
laxed. Machine learning can be viewed as adaptively obtaining a mapping between
measurements and network states, normal or anomalous. The goal is to develop
learning algorithms that can extract pertinent information from measurements and
can adapt to unknown network conditions and/or unseen anomalies. In a broad
sense, statistical machine learning is one class of statistical approaches. The choice
of the learning scenario depends on the information available in the measurements.
For example, a frequent scenario is where there are only raw network measure-
ments available, and thus unsupervised learning methods are used. If additional
11 Anomaly Detection Approaches for Communication Networks 241
capacity of the link [17]. In the heavy-change detection problem, the goal is to
detect the set of flows that have a drastic change in traffic volume from one time
period to another [29].
separate work [59], Zhang et al. also consider the problem of HHH detection,
where the definition of HHH s is simplified such that counts from heavy-hitter items
are not excluded for their parents. The work in [59] establishes a connection between
multi-dimensional HHH detection problems and packet classification problems.
Furthermore, dynamic algorithms based on adaptive synopsis data structures are
shown to be efficient tools for HHH detection and can serve as promising building
blocks for network anomaly detection.
Closely related to the heavy-hitter problem is the heavy distinct-hitter problem,
which can be defined as in [52]: Given a stream of .x; y/ pairs, the heavy distinct-
hitter problem is to find all the xs that are paired with a large number of distinct ys.
For instance, in the case of worm propagation, a compromised host may scan a large
number of distinct destinations in a short time. Such a compromised host can be con-
sidered as a heavy distinct-hitter. Please note that a heavy distinct-hitter might not be
a heavy hitter. For example, the compromised host in the worm propagation example
may only create a limited amount of network traffic and is not considered a heavy
hitter based on traffic volume. On the other hand, a heavy hitter may not be a heavy
distinct-hitter if it is only connected to a limited number of hosts even though its
traffic volume is significantly large. To detect heavy distinct-hitter is meaningful in
many network applications such as identification of Distributed Denial-Of-Service
(DDOS) attacks. In [52], Venkataraman et al. propose one-level and two-level filter-
ing algorithms for heavy distinct-hitter detection. Both theoretical and experimental
results show that these algorithms can provide accurate detection in a distributed
setting with limited memory requirements for data storage.
There has been much progress in heavy-hitter problems as discussed above. Nev-
ertheless, as indicated in [29], heavy hitters are flows that represent a significantly
large proportion of the ongoing traffic or the capacity of the link, but they do not nec-
essarily correspond to flows experiencing significant changes. In terms of network
anomaly detection, heavy-change detection is usually more informative than heavy-
hitter detection. In addition, the solutions for heavy-hitter detection discussed here
usually use data structures that do not have linearity property. Thus, it is difficult to
perform the more general task of query aggregations [45].
into discrete intervals, I1 ; I2 ; : : : , and define the value of AŒai at time interval
k; k D 1; 2; : : : ; t as si;k . The problem of heavy-change detection is to find those
items (keys) satisfying the following condition:
jsi;j j
jsi;j si;k j > " or > ;
maxfjsi;k j; 1g
where " and are predetermined thresholds. Note that in this case changes can be
defined using different measures of differences such as absolute difference, relative
difference and so on [10].
Clearly heavy-change detection is a harder problem than heavy-hitter detection.
In heavy-change detection, the sketch method has shown great potential. The basic
idea is to summarize the input streams so that per-flow analysis can be avoided.
In [29], Krishnamurthy et al. first apply sketches to the heavy-change detection
problem. With sketch-based change detection, input data streams are summarized
using k-ary sketches. After the sketches are created, different time series forecast
models can be implemented on top of the summaries. Then the forecast errors are
used to identify whether there are significant changes in the data stream. The sketch-
based techniques use a small amount of memory and has constant per-record update
and reconstruction costs, thus it can be used for change detection in high-speed
networks with a large number of flows. However, the k-ary sketch-based change
detection has one main drawback: the k-ary sketch is irreversible, thus making it im-
possible to reconstruct the desired set of anomalous keys without querying every IP
address or querying every address in the stream if the IP addresses have been saved.
To address these problems, in [45, 46], Schweller et al. develop change detec-
tion schemes based on reversible sketch data structures. The basic idea is to hash
intelligently by modifying the input keys and hashing functions so that keys with
heavy changes can be recovered [46]. Using reverse hashing schemes, the authors
can efficiently identify the set of all anomalous keys in the sketch. In addition, the
authors introduce the bucket index matrix algorithm for accurate multiple heavy-
change detection. Empirical results show that the reverse hashing is capable of
detecting heavy changes and identifying the anomalous flows in real time. In [18],
Gao et al. extend the work in [29,45,46] by considering an optimal small set of met-
rics and building two-dimensional sketches for flow-level traffic anomaly detection.
The authors also implement a high-speed online intrusion detection system based
on two-dimensional sketches, which is shown to be capable of detecting multiple
types of attacks simultaneously with high accuracy.
In a separate work [10], Cormode et al. introduce the deltoid concept for heavy-
change detection, where a deltoid is defined as an item that has a large difference.
The authors propose a framework based on a structure of Combinational Group
Testing to find significant deltoids in high-speed networks. It is also shown that the
proposed algorithms are capable of finding significant deltoids with small memory
and update time, and with guaranteed pre-specified accuracy. As commented in [45],
deltoids can be considered as an expansion of k-ary sketch with multiple counters
for each bucket in the hash table at the cost of memory requirements for data storage.
246 M. Thottan et al.
In this section, we review statistical approaches for anomaly detection. Fig. 11.1
illustrates the general steps involved in statistical anomaly detection. The first step is
to preprocess or filter the given data inputs. This is an important step as the types of
data available and the timescales in which these data are measured can significantly
affect the detection performance [50]. In the second step, statistical analysis and/or
data transforms are performed to separate normal network behavior from anomalous
behavior and noise. A variety of statistical techniques can be applied here, e.g.,
wavelet analysis, covariance matrix analysis, and principal component analysis. The
main challenge is to find computationally efficient techniques for anomaly detection
with low false alarm rate. In the final step, decision theories such as Generalized
Likelihood Ratio (GLR) test can be used to determine whether there is a network
anomaly based on the deviations observed in the input data.
In a broader context, statistical anomaly detection can also be viewed from a
machine learning perspective, where the goal is to find the appropriate discriminant
function that can be used to classify any new input data vector into the normal or
anomalous region with good accuracy for anomaly detection. One subtle difference
between statistical anomaly detection and machine learning based methods is that
statistical approaches generally focus on statistical analysis of the collected data,
Data Inputs
whereas machine learning methods focus on the “learning” aspect. Thus based on
the availability of models or on the strength of the assumptions that can be made
regarding the normal and anomalous models we can classify learning approaches
under two broad categories: model-based and non-model-based learning.
When the anomaly detection can safely assume a known model for the behavior of
normal and anomalous data we can employ model-based learning approaches. The
assumed models may be actual statistical models defined on the input data and/or
system specific information.
In the past the Kalman filter has been applied successfully to a wide variety of prob-
lems involving the estimation of dynamics of linear systems from incomplete data.
A Kalman filter generally consists of two steps: the prediction step and the estima-
tion step. In the prediction step, the state at time t C 1 is predicted based on all the
observed data up to time t. In the estimation step, the state at time t C 1 is estimated
by comparing the prediction from the previous step with the new observations.
In [47], Soule et al. develop a traffic anomaly detection scheme based on a
Kalman filter. The authors assume that per-link statistics on byte counts are eas-
ily available, but the traffic matrix that includes all pairs of origin-destination flows
are not directly observable. The authors process the link data using a Kalman filter
to predict the traffic matrix one step into the future. After the prediction is made,
the actual traffic matrix is estimated based on new link data. Then the difference
between the prediction and the actual traffic matrix is used to detect traffic volume
anomalies based on different thresholding methods. Kalman filter is a promising
tool for network anomaly detection together with other more complicated models
of non-linear dynamics.
When the anomaly detection problem is presented such that there is no clarity on the
type and specifics of a model that can be assumed one can still use the model-based
learning approaches. The methods used in this case allow for the uncertainty of the
model to co-exist in the detection methods used.
In [58], Yeung et al. develop a covariance matrix method to model and detect
flooding attacks. Each element in the covariance matrix corresponds to the corre-
lation between two monitored features at different sample sequences. The profile
of the normal traffic can then be described by the mathematical expectation of all
covariance matrices constructed from samples of the normal class in the training
dataset. Anomalies can be detected with threshold-based detection schemes. The
work in [58] uses second-order statistics of the monitored features for anomaly de-
tection and is independent of assumptions on prior data distribution.
11 Anomaly Detection Approaches for Communication Networks 249
In [49], the covariance matrix method is extended, where the signs in the
covariance matrices are used directly for anomaly detection. Detecting anomalies
by comparing the sign of the covariance matrices saves computation costs while
maintaining low false alarm rates. In a separate work [38], Mandjes et al. consider
anomaly detection in a voice over IP network based on the analysis of the variance
of byte counts. The authors derive a general formula for the variance of the cumu-
lative traffic over a fixed time interval, which can be used to determine the presence
of a load anomaly in the network.
By employing second-order features, covariance matrix analysis has been shown
to be a powerful anomaly detection method. One interesting direction in this area
is to find what variables best characterize network anomaly and improve detection
performance.
Wavelet analysis has been applied to modeling non-stationary data series because
it can characterize the scaling properties both in the temporal and frequency do-
mains. Wavelet analysis generally consists of two steps [5, 27]: decomposition and
reconstruction. The goal of the decomposition process is to extract from the orig-
inal signal a hierarchy of component signals. The reconstruction process aims to
perform the inverse of the decomposition and recapture the original signal. When
wavelet analysis is applied to network anomaly detection, a decomposition process
can be performed first on different input data such as IP packet headers. Afterwards,
the decomposed signals can be reconstructed across different timescales, which are
then used for anomaly detection. In the reconstruction stage, some signals derived
from the decomposition stage may be suppressed so that one can focus on signals of
interest. For instance, in anomaly detection of network traffic, one may choose to ig-
nore the day and night variation in the traffic volume and suppress the corresponding
signals in the reconstruction stage. One advantage of the wavelet approach is that by
constructing the wavelet basis at different timescales, the signatures of anomalies at
different timescales are preserved. The success of the wavelet analysis techniques
depends on selecting a suitable wavelet transform for a given application [5].
In [41], Miller and Willsky apply wavelet transform techniques to anomaly de-
tection in geophysical prospecting. Although in a different setting, the anomaly
detection problem considered in [41] has some similarities to the problem of net-
work anomaly detection. In both cases, one seeks to detect anomalous situations
using limited and/or partial measurements. For example, in geophysical prospect-
ing, the data available for anomaly detection are usually scattered radiation collected
at medium boundaries [41], while in network anomaly detection, limited data is a
common scenario. In addition, there is the stringent requirement for computationally
efficient anomaly detection with low alarm rate in both cases.
In [5], Barford et al. successfully apply wavelet techniques to network traffic
anomaly detection. The authors develop a wavelet system that can effectively iso-
late both short- and long-lived traffic anomalies. The wavelet analysis in [5] mainly
250 M. Thottan et al.
focuses on aggregated traffic data in network flows. In [27], Kim and Reddy extend
the work in [5] by studying IP packet header data at an egress router through wavelet
analysis for traffic anomaly detection. Their approach is motivated by the observa-
tion that the out-bound traffic from an administrative domain is likely to exhibit
strong correlation with itself over time. Thus, in [27], the authors study the cor-
relation among addresses and port numbers over multiple timescales with discrete
wavelet transforms. Traffic anomalies are detected if historical thresholds are ex-
ceeded in the analyzed signal.
Wavelet analysis is an effective anomaly detection method when dealing with ag-
gregate flow data. However, when applied to modeling the time series corresponding
to volume counts in a time window, Soule et al. [47] find that wavelet-based meth-
ods do not perform as well compared to the simple Generalized Likelihood Ratio
(GLR) test method. The GLR method is used to test the likelihood of an occurrence
of an anomaly by comparing the effectiveness of an assumed traffic model over two
time windows of input data [50].
Statistical learning is in the general context of statistical approaches but with greater
emphasis on techniques that adapt to the measured data. In particular, statistical ma-
chine learning approaches perform anomaly detection and continue to adapt to new
measurements, changing network conditions, and unseen anomalies. A general for-
mulation of this type of network anomaly detection algorithms is as follows:
Let X.t/ 2 < (X in short) be an n-dimensional random feature vector drawn from
a distribution at time t 2 Œ0; t. Consider the simplest scenario that there are two
underlying states of a network, !i ; i D 0; 1, where !0 corresponds to normal net-
work operation, and !1 corresponds to “unusual or anomalous” network state. An
anomaly detection can be considered as determining whether a given observation x
of the random feature vector X is a symptom of an underlying network state !0 or
!1 . That is, a mapping needs to be obtained between X and !i for i D 0; 1. Note
that an anomalous network state may or may not correspond to an abnormal network
operation. For example, flash crowd, which is a surge of user activities, may result
from users’ legal requests for new software or from DDOS or Worm attacks.
Due to the complexity of networks, such a mapping is usually unknown but can
be learned from measurements. Assume that a set D of m measurements is collected
from a network as observations on X , i.e., D D fxi .t/gm i D1 , where xi .t/ is the i th
observation for t 2 Œ0; t. xi .t/ is called a training sample in machine learning. In
addition, another set Dl D fyq gklD1 of k measurements is assumed to be available in
general that are samples .yq D 0; 1/ on !i ’s. yq ’s are called labels in machine learn-
ing. A pair .xi .t/; yi / is called a labeled measurement, where observation xi .t/ is
obtained when a network is in a known state. For example, if measurement xi .t/ is
taken when the network is known to operate normally, yi D 0 and .xi .t/; 0/ is con-
sidered as a sample for normal network operation. If xi .t/ is taken when the network
is known to operate abnormally, yi D 1 and .xi .t/; 1/ is considered as a “signature”
in anomaly detection. In general xi .t/ is considered as an unlabeled measurement,
252 M. Thottan et al.
meaning the observation xi .t/ occurs when the network state is unknown. Hence,
network measurements are of three types:
(a) normal data Dn D fxi .t/; 0gku
i D1 ,
(b) unlabeled data D D fxj .t/gmj D1 , and
(c) anomalous data Dl D fxr .t/; 1gurD1 .
A training set consists of all three types of data in general although a frequent
scenario is that only D and Dn are available. Examples of D include raw measure-
ments on end-to-end flows, packet traces, and data from Management Information
Base (MIB) variables. Examples of Dn and Dl can be such measurements obtained
under normal or anomalous network conditions, respectively.
Given a set of training samples, a machine learning view of anomaly detection is
to learn a mapping f ./ using the training set, where
f ./ W Network Measurements ! !i ,
so that a desired performance can be achieved on assigning a new sample x to one
of the two categories. Figure 11.2 illustrates the learning problem.
A training set determines the amount of information available, and thus cate-
gorizes different types of learning algorithms for anomaly detection. Specifically,
when D and Dn are available, learning/anomaly detection can be viewed as un-
supervised. When D, Dl , and Dn are all available, learning/anomaly detection
becomes supervised, since we have labels or signatures. f ./ determines the ar-
chitecture of a “learning machine”, which can either be a model with an analytical
expression or a computational algorithm. When f ./ is a model, learning algorithms
for anomaly detection are parametric, i.e., model-based; otherwise, learning algo-
rithms are non-parametric, i.e., non-model-based.
Performance
Learning
Training Set ωi
f ( .)
Early work [23, 40, 50] in this area begins with a small network, e.g., interface ports
at a router, and chooses aggregated measurements, e.g., Management Information
Base (MIB) variables that are readily available from a network equipment. Mea-
surements are made within a moving window across a chosen time duration [23,50].
Both off-line and online learning have been considered, and simple algorithms have
been evaluated on the measured data. For example, in [23, 50], the authors select
model-based learning, where X ((t) from normal operations is modeled as a second-
order AR process, i.e., f .X / D a0 Ca1 X Ca2 X 2 C"ai ; i D 0; 1; 2 are parameters
and learned using measurements collected from a moving window of a chosen time-
duration, and " is assumed to be Gaussian residual noise. A likelihood ratio test is
applied. If the sample variance of the residual noise exceeds a chosen threshold,
an observation x is classified as an anomaly. The model has been successful in
detecting a wide-range of anomalies such as flash-crowds, congestion, broadcast
storms [50], and worm attacks [12]. In addition, these anomalies have been detected
proactively before they cause catastrophic network failures [23, 50]. One disadvan-
tage of such an approach is the difficulty of choosing an appropriate timescale where
AR processes can accurately characterize normal behavior of the MIB variables.
The AR model with Gaussian noise is also not applicable to more complex tempo-
ral characteristics, e.g., bursty and non-Gaussian variables.
More recently a broader class of raw measurements has been selected for
anomaly detection. For example, IP packet traces are used to detect faults/anomalies
related to load changes (see [5] and references therein). The number of TCP SYN
and FIN (RST) per unit time has been used to detect DDOS attacks [55]. Route
update measurements have been used to understand routing anomalies and/or BGP
routing instability [25, 60, 61]. Aggregated rates are used to detect bottlenecks in
3G networks [42]. These measurements are of different types but are all aggregated
data (counts) rather than individual packets. Aggregated measurements facilitate
scalable detection but do not contain detailed information, such as in packet headers
that are available from end-to-end flow-based measurements.
More general algorithms have also been developed that go beyond simple mod-
els. One type of algorithm aims at better characterizing temporal characteristics in
measurements. For example, wavelets are applied to BGP routing updates in [60],
and four other types of measurements (outages, flash crowds, attacks, and measure-
ment errors) in [5]. In this case, wavelets can better characterize anomalies across
different timescales. Another example of this type of algorithm is non-parametric
254 M. Thottan et al.
11.3.3.1.2 Clustering
Bayesian Belief Networks are used to capture the statistical dependence or causal-
relations that map a set of input variables to a set of anomalies.
In an early work [23], Hood and Ji applied Bayesian Belief Networks to MIB
variables in proactive network fault detection. The premise is that many variables
in a network may exhibit anomalous behavior upon the occurrence of an event, and
can be combined to provide a network-wide view of anomalies that may be more
robust and result in more accurate detection. Specifically, the Bayesian Belief Net-
work [23] first combines MIB variables within a protocol layer, and then aggregates
intermediate variables of protocol layers to form a network-wide view of anomalies.
Combinations are done through conditional probabilities in the Bayesian Belief Net-
work. The conditional probabilities were determined a priori.
In a recent work [28], Kline et al. use Bayesian Belief Networks to combine and
correlate different types of measurements such as traffic volumes, ingress/egress
packets, and bit rates. Parameters in the Bayesian Belief Network are learned using
measurements. The performance of the Bayesian Belief Network compares favor-
ably with that of wavelet models and time-series detections, especially in lowering
false alarm rates [28].
In [26], Ide and Kashima provide an example of Bayesian Networks in unsu-
pervised anomaly detection at the application layer. In this work a node in a graph
represents a service and an edge represents a dependency between services. The
edge weights vary with time, and are estimated using measurements. Anomaly de-
tection is conducted from a time sequence of graphs with different link weights. The
paper shows that service-anomalies are detected with little information on normal
network-behaviors.
Hidden Markov Models are related to Belief Networks, and other probabilis-
tic graphical models, when certain internal states are unobservable. In [57], Yang
et al. use a Hidden Markov Model to correlate observation sequences and state
transitions so that the most probable intrusion sequences can be predicted. The ap-
proach is applied to intrusion detection datasets and shown to reduce false alarm
rates effectively.
Belief networks are also related to representations of rules. In [34], Lee et al.
describe a data mining framework for building intrusion detection models. The
approach first extracts an extensive set of features that describe each network con-
nection or host session using the audit data. Several types of algorithms such as
classification, link analysis, and sequence analysis are then used to learn rules that
can accurately capture intrusions and normal activities. To facilitate adaptability and
extensibility, meta-learning is used as a means to construct an integrated model that
256 M. Thottan et al.
can incorporate evidence from multiple models. Experiments show that frequent
patterns mined from audit data can serve as reliable anomaly detection models.
network, the authors find that nearly 80% of the network disruptions exhibit some
level of correlation across multiple routers in the network. Then Huang et al. ap-
ply PCA analysis techniques to the BGP updates and successfully detect all node
and link failures and two-thirds of the failures on the network periphery. The work
in [25] also demonstrates that it is possible to combine the analysis of routing
dynamics with static configuration analysis for network fault localization. Thus
network-wide analysis techniques could be applied to online anomaly detection.
However, as indicated in [25], one remaining open issue is to understand what infor-
mation best enables network diagnosis and to understand the fundamental tradeoffs
between the information available and the corresponding performance.
Although PCA-based approaches have been shown to be an effective method for
network anomaly detection, in [43], Ringberg et al. point out the practical difficulty
in tuning the parameters of the PCA-based network anomaly detector. In [43], the
authors perform a detailed study of the feature time series for detected anomalies
in two IP backbone networks (Abilene and Geant). Their investigation shows that
the false positive rate of the detector is sensitive to small differences in the number
of principal components in the normal subspace and the effectiveness of PCA is
sensitive to the level of aggregation of the traffic measurements. Furthermore, a
large anomaly may contaminate the normal subspace, thus increasing the false alarm
rate. Therefore, there remains one important open issue which is to find PCA-based
anomaly detection techniques that are easy to tune and robust in practice.
From the above study of the different anomaly detection approaches that are avail-
able today, it is clear that a black box anomaly detector may indeed be a utopian
dream [53] for two main reasons: (1) the nature of the information that is fed to
the anomaly detector could be varied both in format and range, and (2) the nature of
the anomaly, its frequency of occurrence and resource constraints clearly dictates the
detection method of choice. In [53] the authors propose an initial prototype anomaly
detector that transforms the input data into some common format before choosing
the appropriate detection methodology. This is clearly an area where further re-
search is an important contribution, especially for deployment in service provider
environments where it is necessary to build multiple anomaly detectors to address
the myriad monitoring requirements.
Some of the challenges encountered when employing machine learning ap-
proaches or statistical approaches is the multiple timescales in which different
network events of interest occur. Capturing the characteristics of multi-time-scale
anomalies is difficult since the timescale of interest could be different for different
anomaly types and also within an anomaly type depending on the network condi-
tions. In [40], Maxion and Tan describe the influence of the regularity of data on the
performance of a probabilistic detector. It was observed that false alarms increase
as a function of the regularity of the data. The authors also show that the regularity
of the data is not merely a function of user type or environments but also differs
within user sessions and among users. Designing anomaly detectors that can adapt
to the changing nature of input data is an extremely challenging task. Most anomaly
detectors employed today are affected by the inherent changes in the structure of
the data that is being input to the detector and therefore does affect performance
parameters such as probability of hits and misses, and false alarm rates.
Sampling strategies for multi-time-scale events with resource constraints is an-
other area where there is a need for improved scientific understanding that will aid
the design of anomaly detection modules. In [37], the authors discovered that most
sampling methods employed today introduce significant bias into measured data,
thus possibly deteriorating the effectiveness of the anomaly detection. Specifically,
Mai et al. use packet traces obtained from a Tier-1 IP-backbone using four sam-
pling methods including random and smart sampling. The sampled data is then used
to detect volume anomalies and port scans in different algorithms such as wavelet
models and hypothesis testing. Significant bias is discovered in these commonly
used sampling techniques, suggesting possible bias in anomaly detection.
Often, the detection of network anomalies requires the correlation of events
across multiple correlated input datasets. Using statistical approaches it is challeng-
ing to capture the dependencies observed in the raw data. When using streaming
algorithms also it is impossible to capture these statistical dependencies unless
there are some rule-based engines that can correlate or couple queries from multiple
streaming algorithms. Despite the challenges, the representation of these dependen-
cies across multiple input data streams is necessary for the detailed diagnosis of
network anomalies.
11 Anomaly Detection Approaches for Communication Networks 259
To sum up, there still remain several open issues to improve the efficiency and
feasibility of anomaly detection. One of the most urgent issues is to understand
what information can best facilitate network anomaly detection. A second issue is
to investigate the fundamental tradeoffs between the amount/complexity of infor-
mation available and the detection performance, so that computationally efficient
real-time anomaly detection is feasible in practice. Another interesting problem is
to systematically investigate each anomaly detection method and understand when
and in what problem domains these methods perform well.
References
1. Ahmed T., Coates M., Lakhina A.: Multivariate Online Anomaly Detection Using Kernel
Recursive Least Squares. Proc. of 26th IEEE International Conference on Computer Com-
munications (2007)
2. Ahmed T., Oreshkin B., Coates M.: Machine Learning Approaches to Network Anomaly De-
tection. Proc. of International Measurement Conference (2007)
3. Andersen D., Feamster N., Bauer S., Balaskrishman H.: Topology inference from BGP routing
dynamics. Proc. SIGCOM Internet Measurements Workshop, Marseille, France (2002)
4. Androulidakis G., Papavassiliou S.: Improving Network Anomaly Detection via Selective
Flow-Based Sampling. Communications, IET. Vol. 2, no. 3, 399–409 (2008)
5. Barford P., Kline J., Plonka D., Ron A.: A Signal Analysis of Network Traffic Anomalies. Proc.
of the 2nd ACM SIGCOMM Workshop on Internet Measurements, 71–82 (2002)
6. Cormode G., Korn F., Muthukrishnan S. D., Srivastava D.: Finding Hierarchical Heavy Hitters
in Data Streams. Proc. of VLDB, Berlin, Germany (2003)
7. Cormode G., Muthukrishan S.: Improved Data Stream Summaries: The Count-Min Sketch and
Its Applications. Tech. Rep. 03-20, DIMACS (2003)
8. Cormode G., Johnson T., Korn F., Muthukrishnan S. Spatscheck O., Srivastava D.: Holistic
UDAFs at Streaming Speeds. Proc. of ACM SIGMOD, Paris, France (2004)
9. Cormode G., Korn F, Muthukrishnan S., Srivastava D.: Diamond in the Rough: Finding Hier-
archical Heavy Hitters in Multi-Dimensional Data. Proc. of ACM SIGMOD, 155–166 (2004)
10. Cormode G., Muthukrishnan S.: What’s New: Finding Significant Differences in Network Data
Streams. IEEE/ACM Trans. Netw. 13(6):1219–1232 (2005)
11. Cormode G., Korn. F., Muthukrishnan S., Srivastava D: Finding Hierarchical Heavy Hitters in
Streaming Data. ACM Trans. Knowledge Discovery from Data 1(4) (2008)
12. Deshpande S., Thottan M., Sikdar B.: Early Detection of BGP Instabilities Resulting From
Internet Worm Attacks. Proc. of IEEE Globecom, Dallas, TX (2004)
13. Duda R. O., Hart P., Stork D.: Pattern Classification, 2nd edn. John Willy and Sons (2001)
14. Duffield N.G., Lund C., Thorup M.: Properties and Prediction of Flow Statistics from Sampled
Packet Streams. Proc. of ACM SIGCOMM Internet Measurement Workshop (2002)
15. Ensafi R., Dehghanzadeh S., Mohammad R., Akbarzadeh T.: Optimizing Fuzzy K-Means for
Network Anomaly Detection Using PSO. Computer Systems and Applications, IEEE/ACS
International Conference, 686–693 (2008)
16. Erjongmanee S., Ji C.: Inferring Internet Service Disruptions upon A Natural Disaster. To ap-
pear at 2nd International Workshop on Knowledge Discovery from Sensor Data (2008)
17. Estan C., Varghese G.: New Directions in Traffic Measurement and Accounting. Proc. of ACM
SIGCOMM, New York, USA (2002)
18. Gao Y., Li Z., Chen Y.: A DoS Resilient Flow-level Intrusion Detection Approach for High-
speed Networks, Proc. of IEEE International Conference on Distributed Computing Systems
(2006)
260 M. Thottan et al.
19. Gu Y., McCallum A., Towsley D.: Detecting Anomalies in Network Traffic Using Maximum
Entropy Estimation. Proc. of IMC (2005)
20. Haffner P., Sen S., Spatscheck O., Wang D.: ACAS: Automated Construction of Applica-
tion Signatures. Proc. of ACM SIGCOMM Workshop on Mining Network Data, Philadelphia,
(2005)
21. Hajji H.: Statistical Analysis of Network Traffic for Adaptive Faults Detection. IEEE Trans.
Neural Networks. Vol. 16, no. 5, 1053–1063 (2005)
22. He Q., Shayman M.A.: Using Reinforcement Learning for Pro-Active Network Fault Manage-
ment. Proc. of Communication Technology. Vol. 1, 515–521 (2000)
23. Hood C.S., Ji C.: Proactive Network Fault Detection. IEEE Tran. Reliability. Vol. 46 3, 333–
341 (1997)
24. Huang L., Nguyen X., Garofalakis M., Jordan M.I., Joseph A., Taft N.: Communication-
Efficient Online Detection of Network-Wide Anomalies. Proc. of 26th Annual IEEE Confer-
ence on Computer Communications (2007)
25. Huang Y., Feamster N., Lakhina A., Xu J.: Diagnosing Network Disruptions with Network-
Wide Analysis. Proc. of ACM SIGMETRICS (2007)
26. Ide T., Kashima H.: Eigenspace-Based Anomaly Detection in Computer Systems. Proc. of
the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,
Seattle, 440–449 (2004)
27. Kim S.S., Reddy A.: Statistical Techniques for Detecting Traffic Anomalies Through Packet
Header Data. Accepted by IEEE/ACM Tran. Networking (2008)
28. Kline K., Nam S., Barford P., Plonka D., Ron A.: Traffic Anomaly Detection at Fine Time
Scales with Bayes Nets. To appear in the International Conference on Internet Monitoring and
Protection (2008)
29. Krishnamurthy B., Sen S., Zhang Y., Chan Y.: Sketch-Based Change Detection: Methods, Eval-
uation, and Applications. Proc. of ACM SIGCOMM IMC, Florida, USA (2003)
30. Lall S., Sekar V., Ogihara M., Xu J., Zhang H.: Data Streaming Algorithms for Estimating
Entropy of Network Traffic. Proc. of ACM SIGMETRICS (2006)
31. Lakhina A., Crovella M., Diot C.: Diagnosing Network-Wide Traffic Anomalies. Proc. of ACM
SIGCOMM (2004)
32. Lakhina A., Papagiannaki K., Crovella M., Diot C., Kolaczyk E. N., Taft N.: Structural Anal-
ysis of Network Traffic Flows. Proc. of ACM SIGMETRICS (2004)
33. Lakhina A., Crovella M., Diot C.: Mining Anomalies Using Traffic Feature Distributions. Proc.
of ACM SIGCOMM, Philadelphia, PA (2005)
34. Lee W., Stolfo F., Mok K.W.: A Data Mining Framework for Building Intrusion Detection
Models. Proc. of In IEEE Symposium on Security and Privacy (1999)
35. Lee W., Xiang D.: Information-Theoretic Measures for Anomaly Detection. Proc. of IEEE
Symposium on Security and Privacy (2001)
36. Leland W. E., Taqqu M. S., Willinger W., Wilson D. V.: On the Self-Similar Nature of Ethernet
Traffic, Proc. of ACM SIGCOMM (1993)
37. Mai J., Chuah C., Sridharan A., Ye T., Zang H.: Is Sampled Data Sufficient for Anomaly De-
tection? Proc. of 6th ACM SIGCOMM conference on Internet measurement, Rio de Janeriro,
Brazil. 165–176 (2006)
38. Mandjes M., Saniee I., Stolyar A. L.: Load Characterization and Anomaly Detection for Voice
over IP traffic. IEEE Tran. Neural Networks. Vol.16, no. 5, 1019–1026 (2005)
39. Manku G. S., Motwani R.: Approximate Frequency Counts over Data Streams. Proc. of IEEE
VLDB, Hong Kong, China (2002)
40. Maxion R. A., Tan K. M. C.: Benchmarking Anomaly-Based Detection Systems. Proc. Inter-
national Conference on Dependable Systems and Networks (2000)
41. Miller E. L., Willsky A. S.: Multiscale, Statistical Anomaly Detection Analysis and Algorithms
for Linearized Inverse Scattering Problems. Multidimensional Systems and Signal Processing.
Vol. 8, 151–184 (1997)
42. Ricciato F., Fleischer W.: Bottleneck Detection via Aggregate Rate Analysis: A Real Case in a
3G Network. Proc. IEEE/IFIP NOMS (2004)
11 Anomaly Detection Approaches for Communication Networks 261
43. Ringberg H., Soule A., Rexford J., Diot C.: Sensitivity of PCA for Traffic Anomaly Detection.
Proc. of ACM SIGMETRICS (2007)
44. Rish I., Brodie M., Sheng M., Odintsova N., Beygelzimer A., Grabarnik G., Hernandez K.:
Adaptive Diagnosis in Distributed Systems. IEEE Tran. Neural Networks. Vol. 16, No. 5,
1088–1109 (2005)
45. Schweller R., Gupta A., Parsons E., Chen Y.: Reversible Sketches for Efficient and Accurate
Change Detection over Network Data Streams. Proc. of IMC, Italy (2004)
46. Schweller R., Li Z., Chen Y., Gao Y., Gupta A., Zhang Y., Dinda P., Kao M., Memik G.: Re-
verse hashing for High-Speed Network Monitoring: Algorithms, Evaluation, and Applications.
Proc. of IEEE INFOCOM (2006)
47. Soule A., Salamatian K., Taft N.: Combining Filtering and Statistical Methods for Anomaly
Detection. Proc. of IMC Workshop (2005)
48. Steinder M., Sethi A.S.: Probabilistic Fault Localization in Communication Systems Using
Belief Networks. IEEE/ACM Trans. Networking. Vol. 12, No. 5, 809–822 (2004)
49. Tavallaee M., Lu W., Iqbal S. A., Ghorbani A.: A Novel Covariance Matrix Based Approach for
Detecting Network Anomalies. Communication Networks and Services Research Conference
(2008)
50. Thottan M., Ji C.: Anomaly Detection in IP Networks. IEEE Trans. Signal Processing, Special
Issue of Signal Processing in Networking, Vol. 51, No. 8, 2191–2204 (2003)
51. Thottan M., Ji C.: Proactive Anomaly Detection Using Distributed Intelligent Agents. IEEE
Network. Vol. 12, no. 5, 21–27 (1998)
52. Venkataraman S., Song D., Gibbons P., Blum A.: New Streaming Algorithms for Fast Detection
of Superspreaders. Proc. of Network and Distributed Systems Security Symposium (2005)
53. Venkataraman S., Caballero J., Song D., Blum A., Yates J.: Black-box Anomaly Detection: Is
it Utopian?” Proc. of the Fifth Workshop on Hot Topics in Networking (HotNets-V), Irvine,
CA (2006)
54. Xie Y., Kim H.A., O’Hallaron D. R., Reiter M. K., Zhang H.: Seurat: A Pointillist Approach
to Anomaly Detection. Proc. of the International Symposium on Recent Advances in Intrusion
Detection (RAID) (2004)
55. Wang H., Zhang D., Shin K. G.: Detecting SYN flooding attacks. Proc. of IEEE INFOCOM
(2002)
56. Xu J.: Tutorial on Network Data Streaming. SIGMETRICS (2007)
57. Yang Y., Deng F., Yang H.: An Unsupervised Anomaly Detection Approach using Subtractive
Clustering and Hidden Markov Model. Communications and Networking in China. 313–316
(2007)
58. Yeung D. S., Jin S., Wang X.: Covariance-Matrix Modeling and Detecting Various Flooding
Attacks. IEEE Tran. Systems, Man and Cybernetics, Part A, vol. 37, no. 2, 157–169 (2007)
59. Zhang Y., Singh S., Sen S., Duffield N., Lund C.: Online Identification of Hierarchical Heavy
Hitters: Algorithms, Evaluation and Applications. Proc. of ACM SIGCOMM conference on
Internet measurement. 101–114 (2004)
60. Zhang J., Rexford J., Feigenbaum J.: Learning-Based Anomaly Detection in BGP Updates.
Proc. of ACM SIGCOMM MineNet workshop (2005)
61. Zhang Y., Ge Z., Greenberg A., Roughan M.: Network Anomography. Proc. of ACM/USENIX
Internet Measurement Conference (2005)
Chapter 12
Model-Based Anomaly Detection
for a Transparent Optical Transmission System
Thomas Bengtsson, Todd Salamon, Tin Kam Ho, and Christopher A. White
12.1 Introduction
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 263
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 12,
c Springer-Verlag London Limited 2010
264 T. Bengtsson et al.
Raman amplification [7, 10, 13] uses the transmission (outside plant) fiber as the
amplification medium. High-powered pumps can be coupled to the transmission
fiber in both the forward (co-propagating) and backward (counter-propagating) di-
rections (Figure 12.1). Energy from the pumps is transferred to lower-frequency
signal channels following a characteristic gain function (Figure 12.2). The technol-
ogy offers significant benefits for increasing network transparency due to improved
Signal Propagation
Fig. 12.1 Schematic diagram of a Raman amplifier using both forward and backward pumping
266 T. Bengtsson et al.
Signal Range
Gain
Raman Pumps
Frequency
Fig. 12.2 Schematic diagram of Raman gain. Variable settings of pump levels are associated with
variable gain at down-shifted frequencies. Pump levels are controlled to achieve a flat gain across
the signal frequencies
noise characteristics associated with distributed amplification within the fiber span
and the ability to provide gain flattening (ripple control) through the use of multiple
Raman amplifier pumps (Figure 12.2).
To effectively assess the health of a Raman amplification system, measurements
of signal and pump powers at different physical locations must be jointly collected
and analyzed. Joint analysis of spatially non-local measurements can also provide
better understanding of the propagation of potentially anomalous events. As men-
tioned, this can be especially useful when networks become more transparent, where
greater reach, enhanced switching, and increased automation may result in anoma-
lous effects which propagate far before being detected and corrected. As an example,
consider the case of an amplifier with a hardware or software anomaly that results
in undesirable noise amplification in an unused portion of the signal spectrum. If
uncorrected, this can result in preferential amplification of the noise relative to the
signals at each subsequent downstream amplifier node, and may eventually lead to
a loss of signal several nodes away from the original fault. One key challenge in
the analysis of such complex, interconnected networks is to break the system into
spatially independent components.
In the remainder of this chapter we describe a novel, model-based approach for
anomaly detection and monitoring within Raman-amplified systems. The proposed
methodology assesses if the signal gain across a fiber span is consistent with the
amplification level of the pumps being input to the span. Our approach combines
models of the Raman gain physics with statistical estimation, and results in several
methods that allow precise statements to be made regarding the state of the network,
including detection of anomalous conditions. The methods focus on detecting net-
work behavior that is inconsistent with Raman physics, and hence may capture more
complex anomalies not readily recognized by an engineering-rule-based approach.
12 Model-Based Anomaly Detection for a Transparent Optical Transmission System 267
Evaluating this difference requires knowledge of both the a priori conditional in-
formation set and the conditional probability distribution. In classical control theory
(e.g., Kalman filter), the distribution of the residual measure in (12.1) can often be
derived analytically from first principles, and can thus be used directly to determine
if an observation is congruent with the past and the underlying physical system.
However, in our setting, because the network operates across several physical and
software layers, explicit knowledge of both the relevant information set and the con-
ditional probability distribution represents a substantial challenge.
One way to estimate the variability for a chosen anomaly measure is to generate
the range of values using simulation, and then empirically model the distribution of
268 T. Bengtsson et al.
the measure using statistical techniques. In the remainder of this section, we describe
one such analysis that focuses on a particular measure known as the “ripple” of the
gain profile.
With yi and yi? being the observed and targeted channel powers at frequency i , we
define the observed ripple at a spatial location s, time t, and span configuration by
obsripple .s;t ; / D maxi fy.s;t ; I i/ y ?.s; t; I i/g mini fy.s;t ; I i/ y ?.s;t ; I i/g:
(12.2)
The parameter represents a set of system parameters that fully define the con-
figuration of the fiber span, including the fiber type, fiber length, connector losses,
pump specifications, channel loading, etc. For channels with a flat amplification tar-
get, i.e., yi? y ? ; 8i; the observed ripple is equivalent to the range of channel
powers, i.e.,
To diagnose the system based on the observed ripple, we need to understand the
distribution of obsripple .s; t; / for given s; t, and . As discussed previously, this
baseline may be determined empirically by observing the system when the network
is operating according to specification. However, to obtain the most precise decision
rule, the baseline distribution should be derived given the specific network settings
at spatial location s, time t, and with configuration . Hence, we seek a residual
measure of the form (12.1), i.e.,
which represents the conditional expectation (i.e., the second term on the right-hand
side) in (12.1).
12 Model-Based Anomaly Detection for a Transparent Optical Transmission System 269
Based on the conditions in .s; t; /, we seek the baseline of obsripple .s; t; /,
or, more precisely, the probability density function p.obsripple .s; t; /j .s; t; //:
Assuming an accurate simulator is available, we may approximate the distribution
of obsripple .s; t; / as follows: for a given .s; t; /, and a set of possible values
for the remaining system parameters & that may affect the observation, calculate
obsripple .s; t; /: Repeated simulations covering all variability in & under the same
.s; t; / would give the baseline distribution p.obsripple .s; t; /j .s; t; //: Then,
referring to p.obsripple .s; t; /j .s; t; //; one can obtain the probability that an
observed obsripple .s; t; / can occur under normal operating conditions. An excep-
tionally small value of the probability would signify an anomaly.
A few issues are critical in using simulation to obtain a baseline distribution suitable
for deriving decision rules regarding the network health:
1. Does the simulator “mimic” the network sufficiently close to warrant matching
obsripple .s; t; / with predripple .s; t; /? Can we validate the “machine” approxi-
mation in a step prior to matching?
2. Can we easily define and obtain .s; t; /?
3. How do we model the null-distribution of obsripple .s; t; /?
As it turns out, signal powers in a concatenation of multiple Raman-amplified
spans depend only on the conditions occurring at the entrance of and locally in the
span. Once the input spectrum to the current span is fully specified, variations in
upstream conditions are irrelevant. Thus, in our context, we are dealing with a one-
step Markov process. The variations that remain to be studied then include variations
in the input spectrum, plus the differences in the configuration parameters .
In the next section, we describe a designed experiment where we generate (simu-
late) the null-distribution of the “ripple” metric for one span by varying several key
configuration parameters that are most relevant, assuming a fixed input spectrum.
(A complete description of the distribution can be obtained by repeating the exper-
iment over another set of parameters &, here representing the upstream conditions
that may cause changes in the input spectrum.)
A set of n D 720 ripple measurements are observed from simulation across these
four variables. The connector losses and channel loadings are treated as categorical
variables with two factor levels each [12]. The connector loss levels are “typical”
and “worst case”, and the levels of channel loading are 17 and 128 channels. The co-
pump setting is varied across 20 levels including zero mW and then increments in 10
from 40 to 220 mW. The fiber length is varied across 9 levels from 40 to 120 km. The
experimental design produces a mesh with 22209 D 720 measurement nodes.
In the analysis, both co-pump setting and fiber length are treated as quantitative
variables, producing a 2 2 ANOVA design with two quantitative regressors at each
of the four experimental levels.
Let ˛i .i D 1; 2/ and ˇj .j D 1; 2/ denote the (fixed) main effects of loss and
loading at levels i and j , respectively, and let and be the regression coefficients
for co-pump setting and fiber length, respectively. With p and ` representing pump
ripple
setting and fiber length, and with Yijk denoting the kth ripple measurement at
levels i and j , we write the main-effects-only model as
ripple
Yijk D C ˛i C ˇj C p C ` C ijk ; (12.4)
where is an overall mean effect and ijk corresponds to unmodeled effects and
represents the discrepancy between the regression function and the data.
A regression model including higher-order interaction terms is also used to rep-
resent the data. With notation given in Tables 12.1 and 12.2, the full three-way
interaction model is given by
ripple
Yijk D C ˛i C ˇj C p C ` (12.5)
C.˛ˇ/ij C .˛/i p C .˛ /i ` C .ˇ/j p C .ˇ /j ` C . /p` C Q p C Q ` C
2 2
(12.6)
C.˛ˇ/ij p C .˛ˇ /ij ` C .˛ /i p` C .ˇ /j p` (12.7)
C.˛Q /i p C .ˇQ /j p C .Q /p ` C .˛Q /i ` C .ˇQ /j ` C .Q /p` C C p C C `3
2 2 2 2 2 2 3
(12.8)
Cijk :
In the above equation (12.5) represents the main effects, (12.6) represents the two-
way interactions, and (12.7) and (12.8) models all three-way interactions.
Based on the experimental design and the above model, we next analyze data
from simulation and provide graphical displays of the experimental response surface
as well as the associated ANOVA results.
Using the FROG/RATS simulator [5], we model several SSMF (standard single-
mode fiber) spans of the lengths specified above. Each span also includes the con-
nector losses and is loaded with the number of channels as specified above. The
signal power is amplified by both forward and backward Raman pumping, with the
forward (co-)pumps fixed at the above-specified levels. An additional stage of back-
ward pumping is applied at a dispersion compensating module (DCM) that follows
the transmission fiber. The simulator includes a control module that determines the
optimal setting of the backward Raman pumps to provide a flat gain within a small
tolerance around the power target [4, 5]. The ripple is measured at the endpoint of
the dispersion compensating module, and is determined after all the pump settings
are optimized for each configuration and channel load.
In Figure 12.3 the measured ripple is shown as a histogram and in Figure 12.4
as box-plots. The measured ripple distribution has a mean and standard deviation of
1.35 and 1.61, respectively, and a median value of 0.59. The first and third quartiles
are 0.46 and 1.33. The histogram shown in Figure 12.3 depicts a right-skewed dis-
tribution for the ripple measurements. It should be noted that the depicted density is
a marginal distribution estimate, where the data is collapsed across all experimental
conditions. In principle, the marginal distribution can be used as a baseline distribu-
tion for anomaly detection. If this were the case, the cutoff point would be around
6.0 dB, which corresponds to the 95th quantile of the empirical, marginal distribu-
tion. Thus, we would classify an observed ripple as anomalous whenever the ripple
exceeds 6.0 dB. This value does, however, represent a rather large ripple, and we
seek a more precise value as a function of the parameters in . Indeed, as indicated
272 T. Bengtsson et al.
250
calculated based on a flat
signal target
200
Frequency
150
100
50
0
0 2 4 6 8
ripple (dB)
Table 12.3 Minimum, maximum, and sample quantiles of ripple split by channel loading and loss
Cell Min. 5% First quartile Median Mean Third quartile 95% Max.
17:typical 0.38 0.40 0.43 0.47 0.91 0.99 2.91 3.40
17:worst case 0.38 0.40 0.43 0.50 0.93 1.10 3.03 4.40
128:typical 0.42 0.43 0.53 0.68 1.60 1.50 6.15 6.40
128:worst case 0.43 0.44 0.55 0.72 1.90 2.20 6.81 7.70
Fig. 12.5 Mean level plot of Mean levels: connector loss and channel loading
ripple for connection loss and
1.8
128
channel loading
1.6
ripple (dB)
worst case
1.4
typical
1.2
1.0
17
conn.loss n.chan
Factors
Fig. 12.6 Interaction plot for Interaction plot: connector loss and channel loading
connection loss crossed with
channel loading
n.chan
1.8
128
17
1.6
ripple (dB)
1.4 1.2
1.0
128 128
typical worst case
0.45
0.45
200
0.50
0.50
150
0.65
0.65
0.4 100
1.00
1.00
1.50
2.00
5
2.000
0 .4
1.5
3.00
5.00
3.00
5.00
5
50
Copump Setting (mW)
0
17 17
typical worst case
00
1 0
1..5
.455
0.50
0.65
0.4550
200
0.6
1
2.0
1.50
3.00
.00 0.
0
0
0.5
150
0
0.4
5
0.45
100
0.50
0.50
1..00
2.0050
1.00
0.615
0.65
3.00
1.00
1.50
00
50
2.
40 60 80 100 120
Fiber length (km)
Fig. 12.7 Contour plot of ripple as a function of co-pump setting, fiber length, channel loading,
and connector losses
larger ripples. Figure 12.6 depicts the mean levels for the four experimental condi-
tions, and the near parallel lines indicate a lack of substantial interaction between
connection loss and channel loading. Figure 12.7 further delineates the data using
a contour plot of the (simulated) ripple surface across all combinations of the in-
dependent variables. As can be seen, the main effect due to channel load is clearly
visible as a monotone increase in ripple from 17 to 128 channels. Moreover, in each
of the four subplots, the surface appear bowl-shaped that higher-order interactions
involving fiber length are present.
Although Figures 12.3–12.6 give a good indication of which factors are most
crucial in determining ripple level, we now delineate more formally using ANOVA
which factors and combinations (interactions) of independent variables contribute
to the variability in ripple measurements.
To delineate which factors contribute most significantly to ripple using ANOVA,
we apply a log transformation to the data. This transformation is necessary because
12 Model-Based Anomaly Detection for a Transparent Optical Transmission System 275
Table 12.4 ANOVA table for the reduced model. The overall F -statistic for this model is 1,157
on 7 numerator and 712 denominator degrees of freedom (p-value: < 2.2e-16). The residual
standard error for the model is 0.2405 and the adjusted R-squared value is 0.918
Source, parameter Df Sum Sq Mean Sq F value Pr(>F )
Channel loading, ˇj 1 32.191 32.191 556.457 <2.2e-16
Fiber length, 1 206.179 206.179 3564.061 <2.2e-16
Fiber length2 , Q 1 146.512 146.512 2532.649 <2.2e-16
Connector loss*fiber length, .˛/i 1 1.936 1.936 33.467 1.086e-08
Channel loading*fiber length, .ˇ/j 1 72.040 72.040 1245.297 <2.2e-16
Connector loss*fiber length, ./ 1 1.607 1.607 27.776 1.808e-07
Channel loading*fiber length2 , .ˇQ /j 1 8.009 8.009 138.439 < 2.2e-16
Residuals 712 41.189 0.058
the data distribution is right-skewed (cf., Figure 12.4), and the transformation serves
ripple
to approximately stabilize the variance across experimental conditions. Thus, Yijk
represents the i; j; k:th ripple measurement in log scale, and the ANOVA study is
performed on the transformed data. For the full model described by (12.5)–(12.8),
which includes all main and interaction effects, the F -statistic equals 410.6 with
20 numerator and 699 denominator degrees of freedom (p-value < 2:2e 16). Al-
though the model is highly significant, not all terms in the full model are statistically
significant. Table 12.4 shows a reduced model where only highly significant factors
are retained. The reduced model fits the data well, and has an adjusted R2 of 0.918.
As the ANOVA sums-of-squares represent independent sources of variation
(since the experimental design is orthogonal), the relative importance of the re-
tained factors can be obtained by considering the magnitude of the F -values in
column 5. Inspection of the table shows that two variables, namely, fiber length and
channel load, contribute most significantly to the variability in ripple. The estimated
response surface (in log scale) is given in the upper panel of Figure 12.8.
The fitted log-surface gives a visualization of how the expected ripple varies
smoothly across : From the fitted surface, along with an estimate of the average
magnitude of the residual term, O D 0:2405 (cf., caption of Table 12.4), a simple
anomaly detection rule for excess ripple is based on determining if log.obsripple / is
greater than the estimated log-surface plus 3 O . Equivalently, we may consider this
cutoff point in the data scale. This quantity is depicted in the data scale in the lower
panel of Figure 12.8. Thus, any ripple exceeding this surface is deemed anomalous.
We note that this classification rule is approximate, and is given as an example to
illustrate our first anomaly detection method. More generally, we comment that the
empirical distribution of residuals within each experimental condition can be used
to define more exact cutoff points. When the response surface turns out to be less
regular, one may consider using non-parametric models.
276 T. Bengtsson et al.
−0.70
−0.70
−0.80
150
−0.45
−0.45
0.00
0.00
100
0.40
0.40
0.70
Copump Setting (mW)
0.70
1.00
1.60
1.00
1.60
−0.70
50
0
17 17
typical worst case
0.0
0.00
−0.45
200
−0.870
−0.45
0
−0.800
−0.
−0.7
150
−0.70.80
−0.70 0
−0.8
−0
−0.45
100
−0.45
0.00
0.70
0.70
0.40
1.00
0.40
0.00
50
40 60 80 100 120
Fiber length (km)
Anomaly surface in data scale (dB)
40 60 80 100 120
128 128
typical worst case
200
150
1.00
1.00
1.50
1.50
2.00
100
2.00
3.00
3.00
4.00
5.00
4.00
5.00
Copump Setting (mW)
50
0
17 17
typical worst case
3.00
2.00
1.50
2.00
200
1.50
1.00
1.00
150
1.00
1.00
100
1.50
1.50
2.00
2.00
5.00
4.00
3.00
4.00
3.00
50
40 60 80 100 120
Fiber length (km)
Fig. 12.8 Fitted response surface and anomaly surface from the model described in Table 12.4
The example illustrates that using a simulator we can compute the expected ripple
measures as a function of a variety of system parameters, and that statistical mod-
els of the response surfaces can give simplified rules to guide anomaly detection
in real time. These include rules on the expected values together with measures of
12 Model-Based Anomaly Detection for a Transparent Optical Transmission System 277
In this section, we present the use of a detailed model that describes the interaction
of pump and signal powers in a Raman amplifier as depicted in Figures 12.1 and
12.2. We first give the mathematical description of the physical effects. This is fol-
lowed by the device model of a Raman repeater node that contains additional loss
elements. The losses are an important concern in calibrating the model predictions
(Section 12.4.1). We then describe the use of calibrated predictions in detecting poor
fit to observed gain profile (Section 12.4.2), premature signal degradation, and pump
failure (Section 12.4.3).
The following equation [3] represents a more comprehensive model for power
propagation on a Raman-amplified fiber span of length L that includes signal–signal
pumping, noise generation and Rayleigh back-scattering effects:
Z
@C .z; /
D ˛./C .z; / C ./ .z; / 4hC .z; / bT .; /d (12.9)
@z
Z
C 2h bT .; /ŒC .z; / C .z; /d
Z
CC .z; / a.; /ŒC .z; / C .z; /d ;
278 T. Bengtsson et al.
where
8
ˆ
ˆ
<g.; / if ;
a.; / WD
ˆ
:̂ g.; / if < ;
8
ˆ
<.1 C nT .; // g.; / if ;
bT .; / WD
:̂
nT .; / g.; / if < ;
1
nT .; / WD :
exp .h j j =.kB T // 1
Here, .C=/ .z; / represents the power spectral density at fiber position z and
frequency propagating in the .C=/ forward/backward direction, ˛./ is the fiber
attenuation coefficient, ./ is the Rayleigh back-scattering coefficient, g.; / is
the Raman gain for a pump at frequency and a signal at , h is Planck’s constant,
kB is Boltzmann’s constant, and T is the absolute temperature. Equation 12.9 de-
scribes the power propagation on a Raman fiber span. Note that the power p.z; i /
at position z and in a signal band located at frequency i and with channel width
di is given by
Z
i Cd
i =2
p.z; i / D .z; /d: (12.10)
i d
i =2
For further descriptions of the relevant physical effects, readers are referred to [1,
11, 14].
Equation 12.9 is completed by specifying launch conditions for the forward-
propagating signals and co-pumps and the backward-propagating counter-pumps,
i.e., C .0; / and .L; /. The equations are solved by using a Galerkin-type dis-
cretization in the frequency space and by use of higher-order predictor–corrector
methods in physical space [5].
The propagation model for a Raman repeater node is given in Figure 12.9, and
includes the S-Office and R-Office connector losses S , R . The office losses
Fig. 12.9 Diagram of elements comprising a Raman repeater node. The direction of signal
propagation is from left to right in the diagram
12 Model-Based Anomaly Detection for a Transparent Optical Transmission System 279
are constant values given in dB that degrade the power spectra going through the
.C=/ .C=/
connector, i.e., out ./ D i n ./10 10 :
The detailed physical model is used to produce anomaly alarms in two ways: (i)
measured signal powers at z D 0 and pump powers at both z D 0 and z D L are
used to predict signal powers at z D L, and the predicted values are then compared
to measured signal powers at z D L; and (ii) a maximum likelihood procedure is
used to estimate the connector losses .S ; R /, which are compared to nominal or
provisioned values. Discrepancies observed in either (i) or (ii) indicate anomalies.
The key physical parameters controlling the Raman span behavior are as follows:
the attenuation coefficient ˛./, the Raman gain coefficient g.; /, the fiber length
L, and the connector losses S , R . Values for L, ˛./ and total connector loss
.S CR / can be estimated from OTDR (Optical Time Domain Reflectometry) and
total span loss measurements, while nominal values for the Raman gain coefficient
g.; / are typically used.
To illustrate the important role that connector loss estimation plays in fitting field
data, the root-mean-squared error (RMSE) in the measured versus simulated output
signal powers is plotted in Figure 12.10. The illustrated case has 15 channels prop-
agating on a 103.93 km span of LEAF fiber with 1 forward pump and 5 backward
pumps. Note that there is a continuous range of possible values for the connector
loss, represented by the dark contours near the diagonal, where the simulated and
field data agree to within 0.3 dB. The approximate relation between the S-Office
and R-Office connector losses where the best fit (i.e., smallest RMSE) is obtained is
In Figure 12.11, the field data spectra are compared against two simulated spectra
given by (12.11) with two distinctly different connector loss values. Note that both
Fig. 12.11 Comparison of simulated versus field data for 15 channels propagating on 103.93 km
of LEAF fiber with 1 forward pump and 5 backward pumps: (a) S = 1 dB, R = 1.1 dB; and (b)
S = 0 dB, R = 1.6 dB. RMS error between simulated and field data is 0.3 dB for (a) and (b)
simulated spectra yield nearly identical shapes, suggesting that office loss values
given by (12.11) are adequate for estimating expected output signal powers. This
result shows that there is a continuum of connector loss values yielding good pre-
dictions of the signal gain for this 15-channel configuration.
The detailed physical model can also be used to diagnose anomalies on a Raman
span in an optical network. This is illustrated in Figure 12.12, where the predictions
of the detailed physical model are compared with field data for the results shown
in Figure 12.16 (16 channels propagating on 95.17 km of LEAF fiber on one span
of a DWDM link). Note that even though the simulation results shown in Figure
12.12 correspond to a calibrated, best fit to the field data, there is still a significant
discrepancy between the simulated and field data – the field data exhibit a different
12 Model-Based Anomaly Detection for a Transparent Optical Transmission System 281
spectral shape in the low frequency range along with significantly larger ripple than
the simulations. Furthermore, the RMSE between the simulations and field data is
0.9 dB, significantly larger than the 0.3 dB error observed for the results shown in
Figure 12.11. These discrepancies indicate that anomalous behavior is occurring on
this span.
The detailed physical model can also be used to diagnose anomalous behavior not
readily captured by the engineering-rule based approaches or the R-beta model. In
this section we explore how the detailed physical model can be used to detect the
premature degradation and/or failure of a Raman pump.
In Figure 12.13a the predicted output signal spectra are plotted as a function of
frequency for the case of 56 channels propagating on 94.04 km of TWRS fiber. The
black curve with circles corresponds to the expected output power spectra where all
pumps are behaving normally. The other curves correspond to different calibration
cases describing the Raman span, and illustrate what the expected output would
be if pump 1 was not experiencing the 4 dB degradation of its output power, i.e.,
pump 1 was actually being launched with +4 dB extra power to compensate for
the pump degradation. Note that all of the calibration cases (A–D) show a large
expected increase in the total output signal power, along with an increase in the
a b
5 predicted 5
expt channel expt
Case A Case A
Case B powers
Case B
Case C are too high Case C
Case D
channel power [dBm]
Case D
channel power [dBm]
0 0
_5 _5
_ 10 _ 10
186 187 188 189 190 186 187 188 189 190
frequency [THz] frequency [THz]
Fig. 12.13 Example of a virtual experiment illustrating a 4 dB counter-pump failure for 56 chan-
nels propagating on 94.04 km of TWRS fiber. Black curve with symbols corresponds to non-failure
(normal) case, while curves A, B, C, and D correspond to different calibration cases describing the
physical span and with the launched pump power for the failed pump being +4 dB too large: (a) if
the failure is at pump 1; and (b) if the failure is at pump 2. Note that the predicted channel powers
are approximately 5 dB too large for cases A, B, C, and D, indicating anomalous behavior on the
span
282 T. Bengtsson et al.
signal tilt across the frequency band. The similar spectral shape for the different
calibration cases (A–D) suggests that it is not essential to have an absolute fit to the
fiber properties to detect anomalous behavior such as a pump failure.
In Figure 12.13b results are plotted illustrating the effect of a 4 dB failure at pump
2 for the same fiber and launch conditions shown in Figure 12.13a. The additional
+4 dB in power increases the signal powers, although the tilt (slope) of the signals
with respect to frequency has changed sign relative to Figure 12.13a.
These examples show how the detailed physical model can be used to give a
precise characterization of the effects of various failures. However, the challenges
in calibration leave some uncertainty in identifying the exact reason of the failures.
Also, a proper threshold is needed on the similarity measure between the observed
and predicted gain profiles in order to flag an anomaly. A good threshold value that is
universally useful is often difficult to obtain. In the next section, we describe another
method based on using a mathematical model to predict the gain shape and match it
with observation. By explicitly introducing a noise model, the method reduces the
task of anomaly detection to a standard statistical procedure for outlier detection,
where an alarm threshold can be selected in a principled way.
In this section, we describe an effective way for modeling the contribution of the
physical process and measurement noise to an observed signal. Statistical estimation
using the model allows for proper decomposition of an observed signal to the con-
tribution of each. A standard procedure for goodness-of-fit testing gives alarms on
the anomaly that the observed signal cannot be explained by the underlying physical
process and the expected noise. This may signify a more serious type of anomaly,
such as that the measurement has been made at a wrong spot due to equipment
installation errors.
The following equation approximates the evolution of the channel power p.z; i /
at channel wavelength i as a function of distance z along the fiber in a Raman-
amplified fiber span
Npumps
dp.z; i / X
˛.i /p.z; i / C R.i ; j /p.z; i /p.z; j /: (12.12)
dz
j D1
In (12.12), p.z; j / denotes the pump power at pump wavelength j ; ˛.i / is the
fiber attenuation coefficient at wavelength i , R.i ; j / is the Raman gain coef-
ficient between the pump at wavelength j and the signal at wavelength i , and
Npumps is the number of pumps. The relationship in (12.12) can be alternatively ex-
pressed by dividing by p.z; i / and integrating both sides from z D 0 to L, the fiber
length.
12 Model-Based Anomaly Detection for a Transparent Optical Transmission System 283
RL
Using ˇ D fˇj g, ˇj D 0 p.z; j /d z; for the pump power integrated over the
length of the fiber span, R for a k Npumps matrix with elements Rij R.i ; j /;
and Y D fYi g for the attenuation-adjusted gain for channel i , the above relationship
gives the following model [6]:
Y D Rˇ C ": (12.13)
Here, the quantity measures the gain across the Raman-amplified span that
is not captured by the R-Beta model in (12.13). The magnitude of this term, kk,
is a measure of how well the actual measured data is described by the R-Beta
amplification model, independent of particular pump levels. The residual includes
measurement noise, physical effects left out by the R-Beta approximation, and the
effects of any potential anomalies (see Figure 12.14a).
Computational experiments comparing the R-Beta model with simulations from
the detailed physical model (described in Section 12.4) show that (12.13) captures
the essential physics of Raman-amplified spans extremely well. Measurement noises
are expected to be small, and, thus, large values for the error norm k"k are indicative
of anomalous behavior present on the Raman-amplified span. In Figure 12.15 the
normalized signal gain and the estimated unmodeled gain, defined by kO"k=kRˇk, O
are plotted for 14 data polls on a fiber span. For these 14 polls, typical values for
kO"k=kRˇk O are around 3e-4.
To illustrate how the R-Beta model can be used to detect gain that is not con-
sistent with Raman amplification, consider the data depicted in Figure 12.16, where
the normalized signal gain and the error norm are plotted for 14 data polls for span 7
of the same link. In contrast to the data for span 2 (see Figure 12.15), note the more
pronounced ripple, or undulations, present in the normalized signal gain. The error
a b
normalized gain
power
3 dB
span data
simulation
frequency frequency
Fig. 12.14 (a) Observed Y (Equation 12.13, dotted lines) and predictions from regression model
(solid lines) for two spans on a DWDM link. A large residual vector k"k in the second span (the
lower lines) is indicative of a gain profile that is inconsistent with Raman amplification, and thus
triggers an alarm. (b) Observed signal powers (filled circles) and calibrated simulations (hollow
triangles). The discrepancies in spectral shape and ripple magnitude are indicative of anomalous
behavior
284 T. Bengtsson et al.
Fig. 12.15 Field data consisting of 14 data polls on span 2 of a 12-span DWDM link: (a) normal-
ized signal gain across Raman span; and (b) error norm kO"k=kRˇk O in a fit of the R-beta model to
field data
Fig. 12.16 Field data consisting of 14 data polls on span 7 of a 12-span DWDM link: (a) Nor-
malized signal gain across Raman span; and (b) error norm in a fit of the R-beta model given by
(12.13) to field data. Note that the scale of the x-axis in Figure 12.16b is a factor of 10 larger than
that of Figure 12.15b
norm shown in Figure 12.16b is an order of magnitude larger than the corresponding
values for span 2, and is indicative of anomalous behavior. The large ripple that is
observed on this span is likely not due to improper settings of the Raman pumps – if
the control algorithm regulating the Raman pumps were to have chosen suboptimal
values, i.e., pump values resulting in large ripple, the normalized signal gain would
still be expected to lie in the column space of the Raman gain matrix R, and hence
the error norm should be of similar magnitude to that shown in Figure 12.15b. This
illustrates the usefulness of the R-Beta model as a tool for joint gain-pump analysis
to determine anomalous span behavior.
12 Model-Based Anomaly Detection for a Transparent Optical Transmission System 285
12.6 Conclusions
Acknowledgements We thank Wonsuck Lee, Lawrence Cowsar, and Roland Freund for early
discussions on possible approaches. In identifying useful diagnostic data we were advised by many
members of Alcatel-Lucent’s Optical Networking Group. Special thanks go to Steve Eichblatt,
Sydney Taegar, and Lee Vallone who conceived and developed the transmission diagnostic data
retrieval tool. Narasimhan Raghavan, Bill Thompson, Jeff Sutton, and Tom Kissell provided much
needed insight into the system engineering, operation, and testing issues. Their help is gratefully
acknowledged.
286 T. Bengtsson et al.
References
D. Raz ()
Computer Science Department, Technion – Israel Institute of Technology,
Haifa 32000, Israel
e-mail: [email protected]
R. Stadler
School of Electrical Engineering, KTH Royal Institute of Technology,
SE-100 44 Stockholm, Sweden
e-mail: [email protected]
C. Elster
Qualcomm Israel, Omega Building Matam Postal Agency,
31905, Israel
e-mail: [email protected]
M. Dam
School of Computer Science and Communication, KTH Royal Institute of Technology,
SE-100 44 Stockholm, Sweden
e-mail: [email protected]
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 287
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 13,
c Springer-Verlag London Limited 2010
288 D. Raz et al.
techniques, and then describe two specific cases in which this paradigm gener-
ates provably efficient solutions. The first one is in the area of traffic engineering,
where there is a need to monitor the aggregated delay of packets along a given
network path. The second case deals with the problem of monitoring general ag-
gregated values over the network, with emphasis on computing the values in a
distributed way inside the monitoring layer. All together, we believe that this new
paradigm presents a promising direction to address the challenges of cost-effective
management of future networked systems.
13.1 Introduction
Monitoring, i.e., the process of acquiring state information from a network or a net-
worked system, is fundamental to system operation. In traditional network and sys-
tems management, monitoring is performed on a per-device basis, whereby a cen-
tralized management entity polls the devices in its domain for information, which is
then analyzed and acted upon. Traditional management frameworks and protocols,
including SNMP, TMN, and OSI-SM, support this monitoring paradigm [24].
In this approach, the network is oblivious in the sense that the centralized man-
agement entity initiates all messages and analyzes all results. Over the past 20 years,
this paradigm has proven to be fairly successful for networks of moderate size,
whose configuration rarely changes, and whose states evolve slowly and thus do not
require intervention within seconds by an outside system. However, the paradigm
has two significant weaknesses, which makes it less attractive for today’s (and even
more so for future) networked systems. To start with, the centralized management
approach does not allow for fast reaction to changes, simply because collecting the
information, transporting it over the network to the centralized entity, analyzing it,
and communicating the control decision back to the managed devices takes time.
In addition, in order to check for certain anomalies, the entire system must be
monitored continuously, resulting in a high monitoring overhead.
In this chapter, we describe several monitoring algorithms (which we also refer to
as monitoring protocols) that utilize a new monitoring paradigm called In-Network
Monitoring. This paradigm is designed to address the above shortcomings, and we
demonstrate how it can be applied to managing highly dynamic networked systems.
The main idea of in-network monitoring is to introduce a small management entity
inside each network device, which, in addition to monitoring local parameters, can
also perform limited management functions and communicate with peering entities
in its proximity. The collection of these entities creates a monitoring layer inside
the network, which can perform monitoring and control tasks without involving
the centralized entity. In-network monitoring is investigated within the EU 4WARD
project [1].
Note that the monitoring layer does not replace the centralized management en-
tity, but rather complements it. It is important to understand that computing a global
state may require information from all devices, and generating such a global view
13 In-Network Monitoring 289
using the distributed monitoring layer may, in some cases, be more complex and
computationally expensive than a centralized solution. Moreover, the centralized
management entity is an important reference point for many components in the sys-
tem and thus must remain part of any management paradigm.
The need for distributing monitoring and management tasks has been recognized
before and has been studied in the research community since the mid-1990s. Con-
cepts like management by delegation, mobile agents, and distributed objects have
been developed with the goal of making network management systems more ef-
ficient, better scalable, and less complex (cf. [21]). Within the same time frame,
new engineering concepts in networking and telecommunications, namely active
networking and programmable networks, have appeared, aimed at simplifying the
introduction of new functionality into a network environment ([19, 27, 35]). In-
network monitoring leverages aspects of these efforts as enabling technologies to
create the monitoring layer.
In this chapter, we concentrate on the monitoring of parameters in a network set-
ting. We demonstrate how in-network monitoring can help build better and more
efficient systems. We start with a general description of network monitoring tech-
niques, and then describe two specific cases in which this paradigm generates
provably efficient solutions. The first one is in the area of traffic engineering, where
there is a need to monitor the aggregated delay of packets along a given network
path. The second case deals with the problem of monitoring general aggregated val-
ues over the network, with emphasis on computing the values in a distributed way
inside the monitoring layer. All together, we believe that this new paradigm presents
a promising direction to address the challenges of cost-effective management of fu-
ture networked systems.
We assume that we have a network environment where each network element e can
monitor a local variable f .e/ (this could be, for example, the delay along link e of
a specific flow). We are interested in an aggregated function of these values (e.g.,
the sum) over a subset of network elements (e.g., the link along a given path in the
network), and want to know when this value exceeds a certain predefined threshold.
As described in the Section 13.1 we assume a centralized management entity
(CE), together with local monitoring and restricted processing in each of the ele-
ments. The CE monitors the value of the function over different sets of elements
and determines when these values do not (or soon might not) follow the constraint
on the value of the global function fglobal . Thus the CE only needs to know about
sets whose value is too large with respect to their bound.
We distinguish between several basic monitoring techniques that are in practical
use today. In the first technique, called polling (see, e.g., [28]), the CE polls the
Network Elements (NEs) regarding the value associated with each of the relevant
sets. Each NE sends this information as requested to the CE, and the CE computes
the overall value of each set. This method can be formulated in Figure 13.1.
290 D. Raz et al.
1. for each NE e
2. if request for report arrived from CE
3. send report .e; fe / to CE
1. for each NE e
2. if t i me t o send or local event
3. send report .e; fe / to CE
A different way to obtain the information is when the monitoring initiative comes
from the devices and not from the CE. In this technique, called the push model or
pushing, the local device decides based on the available local information to send a
report to the CE. This push event can be done periodically regardless of the value of
the local variables (oblivious pushing) or it can be reactive, triggered by a change in
the value of one or more local variables. This is illustrated in Figure 13.2.
The number of control packets sent to the CE is limited from above by the number
of nodes N . Taking the average distance to the CE to be B, we get that the network
load generated by a single polling event is O.B N /. The computation load of each
node is O.1/ and the computation load in the CE is O.N /. If we have a single push
event then the generated network load is O.B/, but if we have such an event in each
node the cost is similar to the one of polling.
As mentioned before, if monitoring (either push or pull) is done periodically,
regardless of the network state, we call it oblivious monitoring. Such oblivious
monitoring is good to collect full statistics regarding the values of interest. These
can then be used off-line to analyze the network performance. However, in many
cases, when we need to react to situations where a certain value exceeds its assigned
threshold, we can do with much less than the full information delivered by oblivious
monitoring. Thus, we can use event-driven monitoring, in which the monitoring is
triggered by an event in one of the nodes. In this kind of monitoring – termed re-
active monitoring – the management station (or the CE) reacts to some anomalous
behavior in the network and only then checks whether the global state of the system
is illegal (i.e., its aggregated value exceeds the bound).
Reactive monitoring of network elements in general was initially proposed by
Dilman and Raz [11]. The basic idea is to partition the global bound associated
with a set to small bounds associated with each element (called local thresholds).
Once the local threshold in one of the NEs is exceeded, that NE triggers an alarm
(i.e., sends a control message) to the CE. The CE responds, according to the de-
ployed monitoring technique, by probing the set, or by polling the NEs in this set.
Thus, this general approach yields two reactive monitoring methods. We use these
reactive methods, as well as the oblivious ones, as yardsticks for the performance
of the monitoring algorithms described in Section 13.3 of this chapter that use the
in-network management paradigm.
13 In-Network Monitoring 291
In Sections 13.4 and 13.5, we present three protocols that rely on the push-
based monitoring technique. The protocols – a simple aggregation protocol and
two refinements of it – provide two basic capabilities: first, the capability of con-
tinuously estimating a global aggregation function, such as SUM, AVERAGE, or
MAX, which is computed over local variables associated with NEs; second, the
capability of detecting threshold crossings of such global aggregation functions.
These protocols create spanning trees that interconnect the NEs and use reactive
methods to determine when to send messages towards the root of a tree. They rely
on processing capabilities within the NEs that compute the aggregation functions in
a decentralized way, using the spanning tree for communication between NEs. The
main challenge is to perform such in-network aggregation in a cost-effective way,
i.e., with small protocol overhead.
With the development of modern Internet applications, such as real-time audio and
video, the original “best-effort” design of the Internet Protocol (IP) is no longer
sufficient. This has led in the past few years to the development of the Quality of
Service (QoS) concept, in which applications can request, and the network can pro-
vide the resources needed to guarantee the required service level [39]. This allows
Internet Service Providers (ISPs) to offer predictable service levels in terms of data
throughput capacity (bandwidth), latency variations (jitter), and propagation latency.
In order to supply a certain level of services and to conform to QoS needs,
ISPs need to support a QoS aware mechanism that includes network resource
management capability. In this part of the chapter we investigate monitoring of
flows’ QoS parameters in a network. This is a very influential task essential for
the successful provisioning of network services.
Differential Services (DiffServ) [5] is nowadays considered the preferred way of
implementing QoS in the Internet. In DiffServ, every packet is classified into one
of a small number of possible service classes. DiffServ allows the routers to give
different service to different packets according to their QoS class. For example, a
router can drop packets of a lower class in order to guarantee that the bandwidth re-
quirements of an upper class are satisfied. Nevertheless, resource allocation remains
a big issue; wise planning can dramatically decrease congestion and increase uti-
lization. Moreover, since in DiffServ there is no actual reservation or allocations of
resources to specific flows, guaranteeing the end-to-end performance requirements
is a very challenging task.
In recent years, reduction in equipment prices has led to the use of the most trivial
planning method: over-provisioning. Over-provisioning means that suppliers equip
their networks with more resources than they ever expect to consume, and respond to
congestion by acquiring more equipment. However, over-provisioning fails to sup-
port the new traffic-aware applications; an application may request a significantly
large amount of traffic resources at some special period while the network usage
292 D. Raz et al.
is more meager at other times. Thus, an early allocation of these resources to sup-
port such kind of turbulence is wasteful, which in turn makes the over-provisioning
method ineffective and costly.
An alternative to over-provisioning is dynamic provisioning. Such provisioning
is based on a managing authority – the Centralized Entity1 – which dynamically
allocates resources in the network. Like any other controller, the CE must be sup-
ported by a monitoring facility [3]. This monitoring facility provides the CE with
data that will allow it to make its decisions and judge their results. Needless to say,
the more accurate and up to date those data are, the more successful is the resource
allocation. Note that this type of approach is not applicable for short timescales (say
of several milliseconds). This is due to the fact that the report notification time to the
CE is of this short timescale. Thus, the approach aims at events that last longer, and
cause the delays in the routers’ buffers to increase for timescales of several seconds
or minutes. Clearly, no centralized solution can be used for monitoring of very short
local congestions.
We focus on end-to-end delay monitoring, which is the critical QoS parame-
ter for popular Internet services such as Voice over IP [29]. To perform the delay
monitoring, the CE uses a share of the network resources which in turn reduces
the amount of resources available to the users. The challenge in QoS monitoring is
twofold. On the one hand, we want to provide the CE with the needed information
in order to detect problematic flows, and on the other hand, we want to use as little
communication as possible.
In this part of the chapter we describe another approach to the monitoring
problem – autonomous monitoring that uses the in-network monitoring paradigm.
With autonomous monitoring, the network performs all the processing required to
detect a congested flow and the CE is only informed about flows which are indeed
congested, and thus the network becomes self-monitored. Because congestion de-
tection is fully distributed, the effect of a local congestion, which is due to noise
and which does not expand to the whole flow, subsides after just a few messages are
exchanged. We concentrate on a protocol, called AMoS – Autonomous Monitoring
of Streams – which monitors the end-to-end delay of packets in the flow (see [13]).
AMoS requirements from the network are humble. First, each router must be able
to monitor its own load, as suggested in [8], and trigger a monitoring message if
the local load is greater than a given threshold. Second, upon receiving a monitor-
ing message, a router adds its own load to the accumulated one and makes a simple
comparison before deciding if the message should be dropped, forwarded to the next
hop, or sent to the CE.
We thoroughly analyze the behavior of AMoS and present extensive simulation
results. Our analysis shows that AMoS produces much less traffic than the state-of-
the-art protocols either in low or high load network conditions. Similar to reactive
monitoring approaches, it poses no monitoring load at all unless there is local con-
gestion. Moreover, because local congestion is dealt with locally, AMoS is robust
in the sense that its performance remains almost the same over a wide range of load
1
Sometimes called the Bandwidth Broker.
13 In-Network Monitoring 293
variations and flow lengths. More specifically, even if a third of the links suffer from
a very high load that exceeds their local threshold, AMoS performances are still
better than those of any other monitoring algorithm.
In this section we describe the AMoS algorithm. Note that the routers along a path
of a flow are the set of nodes, and the function f is the delay along the link going
out from this router along the specific path. The algorithm starting point is similar to
reactive monitoring, each link along the flow’s path receives a local delay threshold;
if and only if none of the links exceeds its threshold then the total delay of the flow is
within the desired values (as indicated by the SLA) and thus no further actions are
needed. However, unlike reactive monitoring, when a local threshold is exceeded
on a link, the node attached to this link tries to figure out whether this local event
is a global alarm condition (i.e., the total delay of the flow exceeds its threshold)
or is a false alarm. A false alarm is a situation in which the local threshold has
been exceeded but no global event had occurred. The main idea behind AMoS is
to allow flows to recover locally from false alarms in a distributed way without
involving any CE.
A typical example for a false alarm is a situation in which one of the links of the
flow suffers a high load, and thus the delay over a certain QoS class is higher than
the local threshold, but the next link on the path experiences a very low load, and
therefore the delay of the same class on the same link is much below this link’s local
threshold. If the total delay over these two links is smaller than the sum of the two
local thresholds, they cancel each other out, and there is no need to alert the CE.
Thus, our algorithm uses control messages that are sent along the flow path and
contain aggregated delay and aggregated threshold values. Once a message arrives
at a node, it checks whether adding its local values (local delay and local threshold)
still violates the requirements. If it exceeds the global flow delay it informs the CE
immediately. Otherwise, it adds the local delay and threshold to the aggregated ones.
If the updated delay exceeds the updated threshold, it forwards the message to the
next hop; else the message is dropped.
The formal definition of AMoS is presented in pseudo-code in Figure 13.3 that
uses the following notations. For each node, we use .e; s/ to indicate that flow s
goes over incoming link e. Each node knows the next hop of flow path s, indicated
by next hops . The delay of edge e is denoted by delaye and the local threshold for
this edge is given by thresholde . The total threshold of flow s is total thresholds .
Lines 1 to 9 describe the Initialization Phase of the algorithm. In this part every
node examines its local information for every incoming link by comparing the link
delay (delaye ) to the link threshold (thresholde ) for every flow path s. In the case
when the local delay exceeds its threshold, the node creates a control message
packet.s; delaye ; thresholde / and addresses it to the next node in the flow path (or
to the second node in the path if it is the last node on the path).
294 D. Raz et al.
AMoS Algorithm
The Work Cycle Phase of the algorithm is described in lines 10–18. If a node
receives a message packet.s; delay; threshold/, it adds its local information regard-
ing flow s (delaye and thresholde ) to the received message and examines whether
the calculated delay violates the new calculated threshold (i.e., delay C delaye >
threshold C thresholde ). In such a case, the node updates the received message
with the new information and sends it to the next node along flow path s. At any
point that a node detects that the flow’s global threshold was exceeded it informs the
CE and stops the propagation of control messages (see lines 4 and 13 in Figure 13.3).
It is clear from the algorithm that messages are sent to the CE only if the global
delay exceeds the global threshold. The more complicated part in proving AMoS
correctness is to show that if in some flow the total delay is greater than its threshold,
then a report is sent to the CE regarding this flow. Due to space limitations, we
do not show the entire proof, yet it can be established based on the observation
that no control message can stop its propagation on the path of another control
message.
The more interesting question is the performance of AMoS, that is, the number
of control messages it uses. Note that if there is no local congestion AMoS sends
no messages at all. On the other hand, when the network is congested all over, the
delay in all nodes may exceed their local threshold and n messages (assuming n is
the flow length) will be sent, resulting in n separate notifications to the CE. In such
a situation the message complexity of AMoS could be worst than any of the basic
monitoring algorithms described above. It is therefore important to understand the
behavior of AMoS with respect to network load.
13 In-Network Monitoring 295
As explained before, we try to identify the values of p, for which En < 1. Accord-
ing to Claim 13.3.1.1, if the expected of a received message is 1, then only one
control message is received by this node. Thus, instead of looking at the values m
296 D. Raz et al.
p p ... ...
p p
S0 S1 S2 Si
1−p 1−p
1−p 1−p ... ...
at a specific node k, let us examine the value of m for a fixed message m. When
message m moves on an intermediate link l in its path, the value of m can either
increase by the delay of this link, if it is positive, or decrease if negative. Thus,
given a value of m , the probability that m will be increased by 1 is p, while the
probability that m will be decreased by 1 is .1 p/.
We first assume for simplicity that the flow length is infinite. In this case we can
build an infinite Markov Chain, where state Si represents a message with m D i ,
for 0 i . The probability of moving from state Si to state Si C1 is p, and the
probability of moving from state Si C1 to state Si is 1 p (see Figure 13.4). For
1 p > p (i.e., p < 0:5) the system is stable. The following equations describe the
relationship between the steady-state probabilities.
We conclude that if the probability of a link being loaded is less than 13 , then we
expect that less than one message will be received by a node in average. This result
is very important, since it allows us to evaluate the conditions under which AMoS
performs better than any other algorithm. However, in more realistic cases, network
parameters differ from the described binary loaded/not loaded infinite model, yet
this result still provides a vision of the performance behavior of AMoS.
13 In-Network Monitoring 297
Next we turn to analyze the more realistic case where the flow has a finite
length n. This case can be treated in a similar way, yielding
nC1
p p np
1p 1p nC1 1p
En D 12p
nC1
2 : (13.4)
p 12p
.1p/ 1 1p 1p
We can use this closed formula. The values for different p and n are plotted in
Figure 13.5(a). This figure depicts that as long as we are not too close to the non-
stable point, p D 0:5, the results match very well the values that were calculated
using a recursive formulae we developed.
In this section, we provided the analysis of AMoS performance on a simple
network model. Yet, we still need to evaluate the performance of AMoS in more re-
alistic conditions, since the performance measurements of AMoS in these conditions
may differ from the examined model. These conditions count different parameters,
such as network topology, real-link load functions, node distribution in the network,
and more. The next section deals with this problem, and presents the simulation of
AMoS in various networks with different properties.
The theoretical analysis of the previous section indicates that AMoS will perform
very well in terms of the number of control messages used. However, one needs to
verify that the same holds even without the simplification assumption we made, and
in realistic working configuration scenarios. To do so, we examine in this section the
performance of AMoS using simulation. Our measurements indicate that in realistic
network conditions AMoS outperforms all other algorithms when network utiliza-
tion is up to 95%. Beyond this point there are too many local alerts, and a simpler
algorithm (i.e., Probing or Polling) would perform better.
We first consider a single flow over a path of length n (where n varies from 4 to
20), and we also set up the distance to the CE to be 1 hop from all nodes. (In
Section 13.3.2.2 we run simulations on real and simulated ISP networks, there CE
is placed in a real network node and we consider the real network distances.) In
order to simulate traffic load we used Pareto distribution, and to normalize all the
simulations we did, we set up the local threshold value to be 1. In the simulation,
298 D. Raz et al.
40
p=0.3317 AMoS
p=0.3725 Probes
p=0.415
35 Polls
10 p=0.4575 Reactive Probes
Reactive Polls
Number of messages
p=0.5 30
. p=0.3334
25
20
E(n)
5 15
10
3
2 5
1 0
0.925
0.975
0.95
0.4
0.5
0.6
0.7
0.8
0.9
0
0 5 10 15 20 25 30 35 40 45 50
n Mean Latency
(a) Analytical estimation of the expected number (b) The number of control messages used
of control messages per link as a function of the by the different monitoring algorithms as a
flow length, for various probabilities p. function of the load, n D 16.
12 1
AMoS AMoS Vs Probes (Var 0.2)
Probes AMoS Vs Polls (Var0.2)
Polls AMoS Vs Probes (Var 0.3)
10 Reactive Probes 0.98
Reactive Polls AMoS Vs Polls (Var0.3)
Number of messages
8 0.96
Mean Latency
6 0.94
4 0.92
2 0.9
0 0.88
4 6 8 10 12 14 16 18 20
0.4
0.5
0.6
0.7
0.8
0.9
0.925
0.95
0.975
Fig. 13.5 The number of control messages used by the different monitoring algorithms. Analytical
estimation/single flow simulation
we varied the mean and variance of the Pareto distribution from which the load
values are derived. Clearly, as the mean approaches 1, the probability of exceeding
a local threshold increases, and the number of control messages used by AMoS will
increase. We also expected that the variance will have a similar effect, since when
the variance increases, the probability of exceeding a local threshold increases, but
as indicated by our results, this effect is very mild and in fact AMoS is not sensitive
to the variance.
In order to evaluate the performance of AMoS, we need to compare the number
of control messages used by AMoS with common practice and best-known algo-
rithms. This is done by comparing it to the four algorithms described in Section
13.2. For each of these algorithms we simulated the exact flow of control (monitor-
13 In-Network Monitoring 299
ing) packets, and computed the cost in terms of packets times hops. For the reactive
algorithms, we used the same local threshold as for AMoS (i.e., one).
Figure 13.5(b) depicts the average number of control messages used by our algo-
rithm and by the four basic monitoring algorithms for different values of the mean
of the Pareto Distribution. Each point in this figure is the average of 500 runs, done
on flows of length 16. One can observe that in this region of load, the Polling algo-
rithms outperform the Probing algorithms, and the reactive algorithms do not scale
well as load increases. This is due to the fact that even when the load is not very
high (the mean is 0:7) and when the path is long enough there is a non-neglectable
probability that at least one node will exceed its local threshold. When the load is
lower, the reactive monitoring algorithms outperform the oblivious ones.
AMoS, however, scales very well when load increases. One can see that even
when the mean of the Pareto distribution is 0.9, it requires fewer messages than
Polling. Figure 13.5(c) depicts the same information for a shorter flow of length 6.
As expected, the reactive techniques behave much better on short flows since the
probability of at least one link to exceed the threshold drops. The performance of
AMoS also depends on the path length, yet, we can see that on a flow path of length
6, AMoS outperforms all other algorithms.
The most interesting aspect of the graphs presented in Figure 13.5(b) and Figure
13.5(c) is probably the crossover points. These are the points in which AMoS cost
is equal to the next best algorithm, in most cases Polls. We extracted these points
for the extensive set of the simulations we ran. The results are depicted in Fig-
ure 13.5(d). One can see that for flow length 6 AMoS is superior in any examined
mean, and for flow length of 8 AMoS is better than other algorithms up to mean of
about 0.96. This value drops gradually when the flow length increases to 20. One
can also observe that these points go down when the variance changes from 0:2 to
0:3, but again the drop is very mild, indicating again that AMoS scales well with
variance.
[34]) most works deal with the AS structure and not router level topology. Recently,
Aiello et al. proposed a technique of creating Power Law Random Graphs (PLRG)
[2] that produce more realistic large-scale topologies. Tangmunarunkit et al. (see
[34]) examined the properties of the topologies created by different types of gener-
ators: random graph generators (see [6, 36]), structural generators that focus on the
hierarchical nature of the Internet, and PLRG generators. The authors report that
the degree-based generators (such as PLRG) create the most realistic models of the
Internet. These topology generators assign degrees to the topology nodes and then
uniformly select pairs to form the links in the topology.
Spring et al. created an ISP topology mapping engine called Rocketfuel [33].
They recorded several maps of well-known ISPs, such as AT&T, Sprintlink, and
EBONE, and performed a thorough analysis of large-scale topologies. They con-
cluded that router degrees in the network are distributed mostly according to Weibull
distribution. In our simulations, we used an AT&T map produced by Rocketfuel for
the underlying topology [32].
Another important aspect is the distribution of the flows’ endpoints in the net-
work. For person to person phone calls, one can assume that the endpoints are
distributed uniformly in the network. However, in many cases, for a specific ISP
the endpoints are the connecting points to other ISPs, and thus most traffic concen-
trate on relatively few points. In such a case the links that are close to this point
will be used by many flows. Another factor that affects the number of flows that
use a given link is the overall number of QoS enabled flows in the network. While
in the current Internet this number may be small, our algorithm should be able to
handle a future situation in which this number will be very large when compared to
the network size (i.e., the number of nodes or links).
In order to evaluate the performance of AMoS against the other monitoring
techniques in the most general setting, we simulated the number of messages times
hops they use over a large range of parameters. The network topology we used was,
as mentioned before, either a 9,000 node topology of AT&T network from [33],
or the synthetic random Weibull distributed PLRG graph with 10,000 nodes. We
distributed the source points of the flows either uniformly or using the Zipf Dis-
tribution, and varied the overall number of flows between 500 and 50,000. The end
points of the flows were distributed uniformly. For combination of the above, we ran
our simulation with different load values as in the single flow case; this was done
by choosing different mean and variance values for the Pareto Distribution. Each
of these simulations was averaged over at least 10 separate runs using the same
parameters.
Figure 13.6(a) presents the number of control messages used by the different
monitoring algorithms as a function of the mean of the Pareto Distribution used to
generate the load. As expected, the oblivious monitoring techniques are not affected
by the load, while both the reactive and active techniques generate more monitoring
traffic as the average load increases. One can see, however, that there are little fluc-
tuations in the lines of Probes and Polls. These fluctuations are probably the result
of the large variance of the control traffic in the Zipf to Uniform model, where the
distance of the flow endpoints from the CE changes from one simulation to another.
13 In-Network Monitoring 301
5
x 10
9 x 105
AMoS 9
8 Probes AMoS
Polls 8 Probes
7 Reactive Probes Polls
7 Reactive Probes
Number of Messages
Number of Messages
Reactive Polls
Reactive Polls
6
6
5
5
4
4
3
3
2
2
1
1
0 0
0.925
0.95
0.975
0.4
0.5
0.6
0.7
0.8
0.9
0.925
0.95
0.975
0.4
0.5
0.6
0.7
0.8
0.9
Mean Latency Mean Latency
(a) Real ISP topology, 10k flows, Zipf to (b) Real ISP topology, 10k flows, Uniform to
Uniform Uniform
x 10
5 x 105
14 15
AMoS AMoS
Probes Probes
12 Polls Polls
Reactive Probes Reactive Probes
Number of Messages
10
10
8
6
5
4
0 0
0.925
0.975
0.95
0.925
0.95
0.975
0.4
0.5
0.6
0.7
0.8
0.9
0.4
0.5
0.6
0.7
0.8
0.9
x 104 x 104
10 12
AMoS AMoS
9 Probes Probes
Polls 10 Polls
8 Reactive Polls Reactive Polls
Number of Messages
Number of Messages
7
8
6
5 6
4
4
3
2
2
1
0 0
0.925
0.975
0.95
0.925
0.95
0.975
0.4
0.5
0.6
0.7
0.8
0.9
0.4
0.5
0.6
0.7
0.8
0.9
Fig. 13.6 The number of control messages used by the different monitoring algorithms as a
function of the load
302 D. Raz et al.
Nevertheless, while the traffic of both reactive polls and reactive probes increase
linearly with the mean, that of AMoS increases slower for small values of the mean.
Then, due to the fact that many probes can be sent in parallel, the number of control
messages increases dramatically. Still, AMoS outperforms all other algorithms.
The performance of the algorithms depends, of course, on several other parame-
ters such as the number of flows and the flows’ endpoint distribution. Surprisingly
enough, the endpoint distribution has very little effect on the performance. If we
compare the performance of various algorithms in Figure 13.6(a), where one end-
point of each flow is chosen according to Zipf Distribution, to Figure 13.6(b), where
both endpoints of the flows are chosen uniformly, we see that there is only a very
mild difference.
We also examine the difference between the two endpoint distributions when the
overall number of flows is much larger (50k in our case), and we can observe, in
Figures 13.6(c) and (d), that the performances are not very affected by the load.
In these simulations all control messages sent by the algorithms hold information
regarding one flow. However, for the polling algorithms, a natural approach is to
send one control message that contains information about all flows that go through
the node. We term such a usage of a single control containing information regarding
a number of flows by message sharing.
In this part of the chapter we described a novel distributed self-managed al-
gorithm that monitors end-to-end delay of flows using the in-network monitoring
paradigm. We have shown, using theoretical analysis and extensive simulation study,
that in addition to dramatically reducing the load from the CE, our autonomous mon-
itoring algorithm uses a relatively small amount of network traffic, and it scales very
well when the number of flows and the load increase.
This section presents GAP (Generic Aggregation Protocol), a protocol for continu-
ous monitoring of network-wide aggregates [9]. GAP is an asynchronous distributed
protocol that builds and maintains a BFS (Breadth First Search) spanning tree on an
overlay network. The tree is maintained in a similar way as the algorithm that under-
lies the 802.1d Spanning Tree Protocol (STP). In GAP, each node holds information
about its children in the BFS tree, in order to compute the partial aggregate, i.e., the
aggregate value of the local variables from all nodes of the subtree where this node
is the root. GAP is event-driven in the sense that messages are exchanged as results
of events, such as the detection of a new neighbor on the overlay, the failure of a
neighbor, an aggregate update, or a change in the local variable.
A common approach to computing aggregates in a distributed fashion involves
creating and maintaining a spanning tree (within the monitoring layer) and aggre-
gating state information along that tree, bottom-up from the leaves towards the root
(e.g., [10, 20, 26, 30]). GAP is an example of such a tree-based protocol. It builds its
spanning tree, also called aggregation tree, in a decentralized, self-stabilizing man-
13 In-Network Monitoring 303
Fig. 13.7 Aggregation tree with aggregation function SUM. The physical nodes in the figure refer
to the NEs in Section 13.2, the management station refers to the CE component in Section 13.2
ner, which provides the monitoring protocol with robustness properties. Figure 13.7
shows an example of such an aggregation tree.
A second, less-studied approach to computing aggregates involves the use of
gossip protocols, which typically rely on randomized communication to disseminate
and process state information in a network (e.g., [4, 16, 17, 37]). This approach will
not be further discussed here.
GAP assumes a distributed management architecture, whereby each network
device (referred to as NE in Section 13.2) participates in the monitoring task by run-
ning a management process, either internally or on an external, associated device.
These management processes communicate via a management overlay network for
the purpose of monitoring. We also refer to this overlay as the network graph. The
topology of the overlay can be chosen independently from the topology of the un-
derlying physical network. The aggregation tree shown in Figure 13.7 spans the
management overlay. Each management process contains a leaf node and an ag-
gregating node of this tree. A management station (referred to as CE in Section
13.2) can use any network device as access point for monitoring and invoke the
protocol.
304 D. Raz et al.
In GAP each node maintains a neighborhood table T , such as the one pictured in
Table 13.1, containing an entry for itself and each of its neighbors on the network
graph.
In stable state the table will contain an entry for each live neighbor containing
its identity, its status vis-a-vis the current node (self, parent, child, or peer), its level
in the BFS tree (i.e., distance to root) as a non-negative integer, and its aggregate
weight (i.e., the aggregate weight of the spanning tree rooted in that particular node).
The exception is self. In that case, the weight field will contain only the weight of
the local node.
Initially, the neighborhood table of all nodes n except the root contains a single
entry (n, self, l0 , w0 ) where l0 and w0 is some initial level, resp. weight. The initial
level must be a non-negative integer. The initial neighborhood table of the root con-
tains in addition the entry (nroot , parent, –1, wroot ) where nroot is a “virtual root”
node id used to receive output and wroot is arbitrary. This virtual root convention
ensures that the same code can be used for the root as for other nodes, unlike [12]
where the root is “hardwired” in order to ensure self-stabilization.
The protocol executes using asynchronous message passing. The execution model
assumes a set of underlying services – including failure detection, neighbor discov-
ery, local weight update, message delivery, and timeout – that deliver their output to
the process enqueue as messages of the form (tag, Arg1 , ... , Argn ). The following
five message types are considered:
(fail, n) is delivered upon detecting the failure of node n.
(new, n) reports detection of a new neighbor n. At time of initialization, the list
of known neighbors is empty, so the first thing done by the protocol after initial-
ization will include reporting the initial neighbors.
(update, n, w, l, p) is the main message, called an update vector, exchanged be-
tween neighbors. This message tells the receiving node that the BFS tree rooted
in sending node n has aggregate weight w and that n has the level and parent
specified. This message is computed in the obvious way from n’s neighborhood
table using the operation updatevector.T /. Observe that the parent field of the
update vector is defined only when n’s neighborhood table has more than one
entry.
(weight, w) is delivered as a result of sampling the local weight. The frequency
and precision with which this takes place is not further specified.
(timeout) is delivered upon a timeout.
The main loop of the algorithm is given in pseudocode in Figure 13.8. Each loop
iteration consists of three phases:
proc gap() =
... initialize data structures and services ...
Timeout = 0 ;
New = null ;
Vector = updatevector();
... main loop ...
while true do
receive
{new,From} =>
NewNode = newentry(From) ;
| {fail,From} =>
removeentry(From)
| {update,From,Weight,Level,Parent} =>
updateentry(From,Weight,Level,Parent)
| {updatelocal,Weight} =>
updateentry(self(),Weight,level(self()),parent())
| {timeout} => Timeout = 1
end ;
restoreTableInvariant() ;
NewVector = updatevector();
if NewNode != null
{send(NewNode,NewVector); NewNode = null} ;
if NewVector != Vector && Timeout
{ broadcast(NewVector); Vector = NewVector; Timeout = 0 }
od ;
This section outlines two refinements of the GAP protocol, both of which aim at
reducing protocol overhead while maintaining certain objectives. The first such
refinement, named A-GAP, employs a local filter scheme, whereby a node drops
308 D. Raz et al.
updates when only small changes to its partial aggregate occur. Like the GAP proto-
col, A-GAP performs continuous monitoring of aggregates, but aims at minimizing
protocol overhead while adhering to a configurable accuracy objective. The second
refinement of GAP, called TCA-GAP, detects threshold crossings of aggregates.
It applies the concepts of local thresholds and local hysteresis, aimed at reducing
protocol overhead whenever the aggregate is “far” from the given threshold while
ensuring correct detection.
Both protocols, A-GAP and TCA-GAP, operate in an asynchronous and decen-
tralized fashion. They inherit from GAP the functionality of creating and main-
taining the aggregation tree (specifically, handling node arrivals, departures, and
failures) and that of incremental aggregation. A thorough presentation of A-GAP
and TCA-GAP can be found in [26] and [38], respectively.
Estimating the aggregate at the root node with minimal overhead for a given accu-
racy can be formalized as an optimization problem. Let n be a node in the network
graph, ! n the rate of updates received by node n from its children, F n the filter
width of node n, E root the distribution of the estimation error at the root node, and
" the accuracy objective. The problem can then be stated as: Minimize maxn f! n g
s.t. EŒjE root j ", whereby ! n and E root depend on the filter widths .F n /n , which
are the decision variables.
We developed a stochastic model for the monitoring process, which is based on
discrete-time Markov chains and describes individual nodes in their steady state
[26]. For each node n, the model relates the error E n of the partial aggregate of n,
13 In-Network Monitoring 309
the step sizes that indicate changes in the partial aggregate, the rate of updates n
sends, and the filter width F n . In a leaf node, the change of the local variable over
time is modeled as a random walk. The stochastic model permits us to compute the
distribution E root of the estimation error at the root node and the rate of updates !n
processed by each node.
To solve the optimization problem, A-GAP employs a distributed heuristic,
which maps the global problem into a local problem that each node solves in an
asynchronous fashion. This way, each node periodically computes the local filters
and (local) accuracy objectives for its children. A-GAP continuously estimates the
step sizes in the leaf nodes for the random-walk model using the maximum like-
lihood estimator (MLE). Note that these step sizes are the only variables that the
protocol estimates. All other variables are dynamically computed based on these
estimates.
We evaluated A-GAP through extensive simulations and present here results from
only two scenarios, related to (a) controlling the trade-off protocol overhead vs.
estimation error and (b) real-time estimation of the error distribution. Both scenarios
share the following settings. The management overlay follows the physical topology
of Abovenet, an ISP, with 654 nodes and 1332 links. Link speeds in the overlay are
100 Mbps. The communication delay is 4 ms, and the time to process a message at a
node is 1 ms. The local management variable represents the number of HTTP flows
entering the network at a given node, and thus the monitored aggregate is the current
number of HTTP flows in the network. (In the Abovenet scenarios, the aggregate is
in the order of 20.000 flows.) The local variables are updated asynchronously, once
every second. The evolution of the local variables is simulated based on packet
traces that were captured at the University of Twente at two of their network access
points and then processed by us to obtain traces for all nodes in the simulation [26].
Figure 13.9 gives a result from the first scenario and shows the protocol overhead
(i.e., the maximum number of processed updates across all nodes) as a function of
the experienced error. Every point in the graph corresponds to a simulation run.
We observe that the overhead decreases monotonically as the estimation error in-
creases. Consequently, the overhead can be reduced by allowing a larger estimation
error, and the error objective is an effective control parameter. For example, com-
pared to an error objective of 0 (which results in an experienced error of 4.5), an
error objective of 2 flows (experienced error 5) reduces the load by 30%; an error
objective of 20 flows (experienced error 21) leads to a 85% reduction in load.
Figure 13.10 relates to the second scenario and shows the predicted error dis-
tribution computed by A-GAP and the actual error measured in a simulation run,
for an error objective of 8. The vertical bars indicate the average actual error. As
one can see, the predicted error distribution is close to the actual distribution. More
importantly, the distributions have long tails. While the average error in this mea-
surement period is 8.76, the maximum error during the simulation run is 44 and
310 D. Raz et al.
Fig. 13.9 Protocol overhead incurred by A-GAP as a function of the experienced error e
0,05
Absolute
Avg Error
0,04 Error Predicted
by A-GAP
Actual Error
0,03
0,02
0,01
0
-40 -30 -20 -10 0 10 20 30 Error
Fig. 13.10 Distribution of the error predicted by A-GAP and the actual error at the root node
the maximum possible error (that would occur in an infinite measurement period)
is 70. Based on this observation, we argue that an average error objective is more
significant for practical scenarios than a maximum error objective used by other
authors [10, 22, 30]. We have implemented A-GAP and deployed it on a testbed of
16 commercial routers where it is used for monitoring IP flows [25]. The testbed
measurements are consistent with the simulation studies we performed for different
topologies and network sizes, which proves the feasibility of the protocol design,
and, more generally, the feasibility of effective and efficient real-time flow monitor-
ing in large network environments.
13 In-Network Monitoring 311
Most current research in monitoring aggregates is carried out in the context of wire-
less sensor networks, where energy constraints are paramount and the objective is
to maximize the lifetime of the network. Further, many recent works on monitor-
ing the evolution of aggregates over time focus on n-time queries that estimate the
aggregate at discrete times and are realized as periodic snapshots (e.g., [10, 23, 30]).
The trade-off between accuracy and overhead for continuous monitoring of
aggregates has been studied first by Olston et al. who proposed a centralized moni-
toring protocol to control the trade-off [22, 23].
The main differentiator between A-GAP and related protocols is its stochastic
model of the monitoring process. This model allows for a prediction of the protocol
performance, in terms of overhead and error, and the support of flexible error objec-
tives. In fact, all protocols known to us that allow controlling the trade-off between
accuracy and overhead can support only the maximum error as accuracy objective,
which, as we and others pointed out, is of limited practical relevance.
A key idea in TCA-GAP is to introduce and maintain local thresholds that apply to
each node in the aggregation tree. These local thresholds allow nodes to switch be-
tween an acti ve state, where the node executes the GAP protocol and sends updates
of its partial aggregate to its parent, and a passi ve state, where the node ceases to
propagate updates up the aggregation tree. The transition between active and passive
state is controlled by a local threshold and a local hysteresis mechanism.
We restrict the discussion here to the case where the crossing of the upper global
threshold is detected. (Detecting a downward crossing of the lower threshold T g
can be achieved in a very similar way [38].) For reason of readability, we do not
312 D. Raz et al.
The second condition under which node i might need to switch to active state
concerns the situation where one or more of its children are active. Recall that the
local hysteresis mechanism ensures that the actual aggregate of a subtree rooted in
a passive child j does not exceed Tj (at least in the approximate sense as computed
by the underlying aggregation protocol, GAP). Thus, a sufficient condition for the
actual aggregate of i ’s subtree to not exceed Ti is that the sum of aggregates reported
by active children does not exceed the sum of the corresponding local thresholds.
This motivates the second local threshold rule:
X X 0
(R2) Tj , where J is the set of active children of node i .
0 0
j 2J j 2J aj .t /
Rules R1 and R2 together ensure that local threshold crossings will be detected
[38].
If one of the rules R1 or R2 fails on node i , then the node attempts to reinstate
the rule by reducing the threshold of one or more passive children. We call this pro-
cedure threshold recomputation. Specifically, if (R1) fails, then
Pthe protocol reduces
the threshold of one or more passive children by D wi C j 2J Tj Ti , where
J is the set of children of i . Evidently, this may cause one or more passive children
to become active.
If (R2) fails, then
P the protocol reduces the threshold of one or more passive
0
children by > j 2J 0 .aj Tj / where J is the set of active children, and, at
the same time, increases the assigned threshold of one or more active children by
the same amount, which will reinstate (R2). Such a reduction is always possible
since the node is passive.
314 D. Raz et al.
There are many possible policies for threshold recomputation. For instance, there
are several ways to choose the set of active children whose threshold is increased.
Note though that the amount of threshold increment for child j must not exceed
aj
k2
Tj . If it does, there exists a scenario in which two children alternately borrow
threshold space from each other and the system oscillates. In TCA-GAP the protocol
a
identifies the smallest set of active children j with the largest values of kj1 Tj
0 P a P
from all j 2 j so that j 2J 0 . kj1 Tj / > j 2J 0 .aj Tj /. Then is chosen such
P a a
that D j 2J 0 . kj1 Tj / and the threshold of a child j is increased by kj2 Tj
for all j 2 j .
There are also options on how to choose the set of passive children whose thresh-
old is reduced. Here is a policy that is also used in the simulation results below: The
child j with the largest threshold Tj is selected. If Tj , then j is the only child
whose threshold is reduced. Otherwise, Tj is reduced to 0, and this procedure is
applied to the child with the second largest threshold and D Tj . This policy
attempts to minimize the overhead for threshold updating at the cost of increasing
the risk of nodes becoming active.
We illustrate the efficiency of TCA-GAP for a scenario with a setup very similar
to the one given in Section 13.5.1.3, with the major exception that the traces of the
local variables in this scenario are obtained by adding a sinusoidal bias to the traces
in Section 13.5.1.3 in order to create threshold crossings. (For a detailed description
of the scenario and results from a thorough simulation study, see [38].)
Figure 13.12 shows the change of the aggregate and the protocol overhead over
time during a simulation run of 45 s. Three threshold crossings occur: at around t D
9 s (upper threshold crossing), t D 23:5 s (lower threshold crossing) and t D 39 sec
(upper threshold crossing). Before each threshold crossing, e.g., between t D 8 s
Acknowledgements This work has been conducted as part of the EU FP7 Project 4WARD on
Future Internet design [1].
316 D. Raz et al.
References
24. G. Pavlou. On the evolution of management approaches, frameworks and protocols: A histori-
cal perspective. J. Netw. Syst. Manage., 15(4):425–445, 2007.
25. A. Gonzalez Prieto and R. Stadler. Monitoring flow aggregates with controllable accuracy. In
10th IFIP/IEEE International Conference on Management of Multimedia and Mobile Networks
and Services (MMNS 2007), San Jos, California, USA, Oct 31 - Nov 2, 2007.
26. A.G. Prieto and R. Stadler. A-gap: An adaptive protocol for continuous network monitoring
with accuracy objectives. Network and Service Management, IEEE Transactions on, 4(1):
2–12, June 2007.
27. D. Raz and Y. Shavitt. Active networks for efficient distributed network management. Com-
munications Magazine, IEEE, 38(3):138–143, Mar 2000.
28. K. Salamatian and S. Fdida. Measurement based modeling of quality of service in the inter-
net: A methodological approach. In IWDC ’01: Proceedings of the Thyrrhenian International
Workshop on Digital Communications, pages 158–174, London, UK, 2001. Springer-Verlag.
29. H. Schulzrinne, A. Rao, and R. Lanphier. Real time streaming protocol (RTSP), RFC 2326,
1998.
30. M. A. Sharaf, J. Beaver, A. Labrinidis, and P. K. Chrysanthis. Balancing energy efficiency and
quality of aggregate data in sensor networks. ACM International Journal on Very Large Data
Bases, 13(4):384–403, 2004.
31. Izchak Sharfman, Assaf Schuster, and Daniel Keren. A geometric approach to monitoring
threshold functions over distributed data streams. In SIGMOD ’06: Proceedings of the 2006
ACM SIGMOD international conference on Management of data, pages 301–312. ACM Press,
2006.
32. N. Spring, R. Mahajan, and D. Wetherall. Rocketfuel maps and data. http://www.cs.
washington.edu/research/networking/rocketfuel/.
33. N. Spring, R. Mahajan, and D. Wetherall. Measuring ISP topologies with rocketfuel. In Pro-
ceedings of ACM/SIGCOMM ’02, August 2002.
34. H. Tangmunarunkit, R. Govindan, S. Jamin, S. Shenker, and W. Willinger. Network topology
generators: Degree-based vs structural. In ACM SIGCOMM, August, 2002.
35. David L. Tennenhouse and David J. Wetherall. Towards an active network architecture. SIG-
COMM Comput. Commun. Rev., 26(2):5–17, 1996.
36. B. M. Waxman. Routing of multipoint connections. IEEE Journal of Selected Areas in Com-
munications, 6(9):1617–1622, December 1988.
37. F. Wuhib, M. Dam, R. Stadler, and A. Clemm. Robust monitoring of network-wide aggregates
through gossiping. In IEEE Transactions on Network and Service Management (TNSM), 6(2),
June 2009.
38. F. Wuhib, M. Dam, and R. Stadler. Decentralized detection of global threshold crossings using
aggregation trees. Computer Networks, 52(9):1745–1761, February 2008.
39. X. Xiao and L. M. Ni. Internet QoS: A big picture. IEEE Network, 13(2):8–18, March 1999.
Chapter 14
Algebraic Approaches for Scalable End-to-End
Monitoring and Diagnosis
Abstract The rigidity of the Internet architecture led to flourish in the research
of end-to-end based systems. In this chapter, we describe a linear algebra-based
end-to-end monitoring and diagnosis system. We first propose a tomography-based
overlay monitoring system (TOM). Given n end hosts, TOM selectively monitors
a basis set of O.n log n/ paths out of all n.n 1/ end-to-end paths. Any end-to-
end path can be written as a unique linear combination of paths in the basis set.
Consequently, by monitoring loss rates for the paths in the basis set, TOM infers loss
rates for all end-to-end paths. Furthermore, leveraging on the scalable measurements
from the TOM system, we propose the Least-biased End-to-End Network Diagnosis
(in short, LEND) system. We define a minimal identifiable link sequence (MILS)
as a link sequence of minimal length whose properties can be uniquely identified
from end-to-end measurements. LEND applies an algebraic approach to find out the
MILSes and infers the properties of the MILSes efficiently. This also means LEND
system achieves the finest diagnosis granularity under the least biased statistical
assumptions.
14.1 Introduction
“When something breaks in the Internet, the Internet’s very decentralized structure makes
it hard to figure out what went wrong and even harder to assign responsibility.”
– “Looking Over the Fence at Networks: A Neighbor’s View of Networking Research”, by
Committees on Research Horizons in Networking, US National Research Council, 2001.
The rigidity of the Internet architecture makes it extremely difficult to deploy in-
novative disruptive technologies in the core. This has led to extensive research into
overlay and peer-to-peer systems, such as overlay routing and location, application-
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 319
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 14,
c Springer-Verlag London Limited 2010
320 Y. Zhao and Y. Chen
level multicast, and peer-to-peer file sharing. These systems flexibly choose their
communication paths and targets, and thus can benefit from estimation of end-to-end
network distances (e.g., latency and loss rate). Accurate loss rate monitoring systems
can detect path outages and periods of degraded performance within seconds. They
facilitate management of distributed systems such as virtual private networks (VPN)
and content distribution networks; and they are useful for building adaptive overlay
applications, like streaming media and multiplayer gaming.
Meanwhile, Internet fault diagnosis is important to end users, overlay network
service providers (like Akamai [2]), and Internet service providers (ISPs). For exam-
ple, with Internet fault diagnosis tools, users can choose more reliable ISPs. Overlay
service providers can use such tools to locate faults in order to fix them or bypass
them; information about faults can also guide decisions about service provisioning,
deployment, and redirection. However, the modern Internet is heterogeneous and
largely unregulated, which renders link-level fault diagnosis an increasingly chal-
lenging problem. The servers and routers in the network core are usually operated by
businesses, and those businesses may be unwilling or unable to cooperate in collect-
ing the network traffic measurements vital for Internet fault diagnosis. Therefore,
end-to-end diagnosis approaches attract most of the focus of researchers in this area.
Thus it is desirable to have a scalable end-to-end loss rate monitoring and
diagnosis system which is accurate and scalable. We formulate the problem as fol-
lows: consider an overlay network of n end hosts; we define a path to be a routing
path between a pair of end hosts, and a link to be an IP link between routers. A path
is a concatenation of links. We also rely on the two fundamental statistical assump-
tions for any end-to-end network monitoring and diagnosis approaches.
End-to-end measurement can infer the end-to-end properties accurately.
The linear system between path- and link-level properties assumes independence
between link-level properties.
Therefore, the monitoring problem we focus on is to select a minimal subset of paths
from the O.n2 / paths to monitor so that the loss rates and latencies of all other paths
can be inferred. Also, we aim to only use the above basic assumptions to achieve
the least biased and, hence, the most accurate, diagnosis based on the measurement
results of the paths. We define a minimal identifiable link sequence (MILS) as a
link sequence of minimal length whose properties can be uniquely identified from
end-to-end measurements without bias.
In this chapter, we describe a linear algebra-based end-to-end monitoring and
diagnosis system. Specifically, the monitoring system we propose is a tomography-
based overlay monitoring system (TOM) [9] in which we selectively monitor a basis
set of k paths. Any end-to-end path can be written as a unique linear combination
of paths in the basis set. Consequently, by monitoring loss rates for the paths in
the basis set, we infer loss rates for all end-to-end paths. This can also be extended
to other additive metrics, such as latency. The end-to-end path loss rates can be
computed even when the paths contain unidentifiable links for which loss rates can-
not be computed. Furthermore, based on the measurements from the TOM system,
we propose the Least-biased End-to-end Network Diagnosis (LEND) system [34],
14 Algebraic Approaches for Scalable End-to-End Monitoring and Diagnosis 321
which applies an algebraic approach to find out the MILSes and infers the proper-
ties of the MILSes. This also means our LEND system achieves the finest diagnosis
granularity under the least biased case.
We describe the following properties of the TOM and LEND systems:
For reasonably large n (say 100), the basis path set has k D O.n log n/ through
linear regression tests on various synthetic and real topologies. We also provide
some explanation based on the Internet topology and the AS hierarchy, which
shows that the monitoring system TOM is scalable.
We advocate the unbiased end-to-end diagnosis paradigm and introduce the
concept of MILS. However, taking a network as a directed graph, when only
topology information is used, we prove that each path is an MILS: no path
segment smaller than an end-to-end path has properties which can be uniquely
determined by end-to-end measurements. To address the problem, we observe
that, in practice, there are many good paths with zero loss rates. Then as a fact
rather than as a statistical assumption, we know all the links on such paths must
also have no losses. Based on such observation, we propose a “good path” al-
gorithm, which uses both topology and measurement snapshots to find MILSes
with the finest granularity.
In an overlay network, end hosts frequently join/leave the overlay and routing
changes occur from time to time. For both TOM and LEND systems to adapt to
these efficiently, we design incremental algorithms for path addition and deletion
with small cost instead of reinitializing the system with high cost. We propose
randomized schemes for measurement load balancing as well. The details of
these algorithms can be found in [9, 34].
The TOM and LEND systems have been evaluated through extensive simulations
and Internet experiments. In both simulations and PlanetLab experiments, path loss
rates have been estimated with high accuracy using O.n log n/ measurements, and
the lossy links are further diagnosed with fine granularity and accuracy. For the
PlanetLab experiments, the average absolute error of path loss rate estimation is
only 0.0027, and the average relative error rate is 1.1, even though about 10% of the
paths have incomplete or non-existent routing information. Also in the PlanetLab
experiments of LEND, the average diagnosis granularity is only four hops for all the
lossy paths. This can be further improved with larger overlay networks, as shown
through our simulation with a real router-level topology from [20]. In addition, the
loss rate inference on the MILSes is highly accurate, as verified through the cross-
validation and IP spoof-based validation schemes [34].
For the PlanetLab experiments with 135 hosts, the average setup (monitoring
path selection) time of the TOM system is 109.3 s, and the online diagnosis of
18,090 paths, 3,714 of which are lossy, takes only 4.2 s. In addition, we adapt to
topology changes within seconds without sacrificing accuracy. The measurement
load balancing reduces the load variation and the maximum vs. mean load ratio
significantly, by up to a factor of 7.3.
For the rest of the chapter, we first survey related work in the next section. Then
we describe the linear algebraic model and the system architecture in Section 14.3,
322 Y. Zhao and Y. Chen
present the monitoring path selection (TOM) in Section 14.4 and the diagnosis algo-
rithms (LEND) in Section 14.5. Internet experiments are described in Sections 14.6,
while simulations are omitted in the chapter. Finally, we discuss and conclude in
Sections 14.7.
There are many existing scalable end-to-end latency estimation schemes, which
can be broadly classified into clustering-based [10, 18] and coordinate-based sys-
tems [23, 28]. Clustering-based systems cluster end hosts based on their network
proximity or latency similarity under normal conditions, then choose the centroid of
each cluster as the monitor. But a monitor and other members of the same cluster
often take different routes to remote hosts. So the monitor cannot detect conges-
tion for its members. Similarly, the coordinates assigned to each end host in the
coordinate-based approaches cannot embed any congestion/failure information.
Later on, linear algebraic model is introduced into end-to-end monitoring, and
many monitoring system designs including TOM use mathematical and statistical
approaches to infer the whole network properties from measurements on carefully
selected path sets. Ozmutlu et al. select a minimal subset of paths to cover all links
for monitoring, assuming link-by-link latency is available via end-to-end measure-
ment [24]. But the link-by-link latency obtained from traceroute is often inaccurate.
And their approach does not work well for loss rate because it is difficult to estimate
link-by-link loss rates from end-to-end measurement. A similar approach was taken
for selecting paths to measure overlay network [31]. The minimal set cover selected
can only give bounds for metrics like latency, and there is no guarantee as to how
far the bounds are from the real values. TOM selectively monitor a basis set of all
paths, and further stimulate several following works. In [11], Chua et al. proposed
an SVD (Singular Value Decomposition) based solution, which selects fewer paths
than the basis of the path matrix, while all the unmeasured path properties can be
inferred without severe degradation of accuracy. More recently, Song et al. [30] in-
troduced the Bayesian experimental design framework into network measurement.
In [30], the best set of paths that achieves the highest expected estimation accuracy
is selected, given the constraint on the total number of selected paths.
Ping and traceroute are the earliest Internet diagnosis tools, and they are still widely
used. However, the asymmetry of Internet routing and of link properties makes it
difficult to use these tools to infer properties of individual links. The latest work on
14 Algebraic Approaches for Scalable End-to-End Monitoring and Diagnosis 323
network diagnosis can be put into two categories: pure end-to-end approaches [1, 5,
7, 12, 14, 16, 25] and router response-based approaches [3, 21, 33].
Most end-to-end tomography tools fall in one of two classes. First, several end-
to-end tomography designs are based on temporal correlations among multiple
receivers in a multicast-like environment [1, 5, 7, 12, 16]. Adams et al. [1] use a sin-
gle multicast tree and then Bu et al. [5] extend Internet tomography to the general
topologies. Meanwhile, Duffield et al. [16] proposed to use back-to-back probing in
unicast to mimic multicast and hence get over the multicast barrier in real networks.
Second, some other tools [14, 25] impose additional statistical assumptions beyond
the linear loss model described in Section 14.3.1.
Under certain assumptions, tools in the first class infer a loss rate for each virtual
link (i.e., sequence of consecutive links without a branching point) with high prob-
ability. Thus, these tools diagnose failures at the granularity of individual virtual
links; obviously, this is a bound on the granularity obtainable by the end-to-end to-
mography system. Typically these systems assume an ideal multicast environment;
but since true multicast does not exist in the Internet, they use unicast for approxi-
mation. Thus the accuracy of the probe measurements heavily depends on the cross
traffic in the network, and there is no guarantee of their accuracy.
As for the second class of tools, the statistically based tools introduced in [14,25]
and use only uncorrelated end-to-end measurements to identify lossy network links.
One shortcoming of these tools is apparent when studying simple tree topology in
Figure 14.1. The numbers in the figure are the loss rates of the corresponding paths
or links. In this tree, we can only measure the loss rates of two paths: A ! B
and A ! C . In the Figure 14.1(a) and (b) show two possible link loss rates that
lead to the same end-to-end path measurements. The linear programming approach
in [25] and SCFS [14] will always obtain the result of (a) because they are biased
toward minimizing the number of lossy link predictions; but such results may not be
correct. As for the random sampling and Gibbs sampling approaches in [25], either
(a) or (b) may be predicted. In fact, none of the loss rates for these three links are
identifiable from end-to-end measurements, and the LEND system will determine
that none of the individual links are identifiable, and will get MILSes A ! N ! B
and A ! N ! C .
a b
Fig. 14.1 Example of an underconstrained system: (a) one possible link loss scenario (b) a differ-
ent link loss scenario with the same end-to-end path measurements
324 Y. Zhao and Y. Chen
Other than the above two classes, Shavitt et al. use a linear algebraic algorithm to
compute some additional “distances” (i.e., latencies of path segments) that are not
explicitly measured [29]. The algorithm proposed in [29] has the same function as
our link-level diagnosis algorithm in undirected graph model. However, the LEND
system incorporates the scalable measurement approach designed in TOM [9] and
reuses its outputs to save the computational complexity for link-level diagnosis,
and hence our LEND system is both measurement cost-efficient and computation-
efficient. More importantly, the Internet should be modeled as a directed graph. The
algebraic algorithm in [29] fails to do any link-level diagnosis on directed graphs,
shown by Theorem 1 in Section 14.5.3.1.
In this section, we briefly describe the algebraic model and the system architecture
of the LEND system. The algebraic model is widely used in Internet tomography
and other measurement works [1, 9, 29]. For easy indexing, all the important nota-
tions in the paper can be found in Table 14.1.
Y
s
v
1p D 1 lj j : (14.1)
j D1
Equation (14.1) assumes that packet loss is independent among links. Caceres
et al. argue that the diversity of traffic and links makes large and long-lasting spatial
link loss dependence unlikely in a real network such as the Internet [6]. Further-
more, the introduction of Random Early Detection (RED) [17] policies in routers
will help break such dependence. In addition to [6], formula (14.1) has also been
proven useful in many other link/path loss inference works [5, 15, 25, 31]. Our In-
ternet experiments also show that the link loss dependence has little effect on the
accuracy of (14.1).
Let us take logarithms on both
sides of (14.1). Then by defining a column vector
x 2 Rs with elements xj D log 1 lj , and writing vT for the transpose of the
column vector v, (14.1) is rewritten as follows:
X
s
Xs
log .1 p/ D vj log 1 lj D vj xj D vT x: (14.2)
j D1 j D1
There are r D O.n2 / paths in the overlay network, and thus there are r linear
equations of the form (14.2). Putting them together, we form a rectangular matrix
G 2 f0; 1gr s . Each row of G represents a path in the network: Gij D 1 when path
i contains link j , and Gij D 0 otherwise. Let pi be the end-to-end loss rate of the
i th path, and let b 2 Rr be a column vector with elements bi D log .1 pi /. Then
we write the r equations in form (14.2) as
Gx D b: (14.3)
Normally, the number of paths r is much larger than the number of links s (see
Figure 14.2(a)). This suggests that we could select s paths to monitor, use those
measurements to compute the link loss rate variables x, and infer the loss rates of
the other paths from (14.3).
However, in general, G is rank-deficient, i.e., k = rank(G) and k < s. If
G is rank-deficient, we will be unable to determine the loss rate of some links
from (14.3). These links are also called unidentifiable in network tomography
literature [5].
326 Y. Zhao and Y. Chen
a s
b
s
s
r = r
… k … s
= k
Gx D b N G D bN
Gx
A 1 1 0
b2 x2 (1,1,0)
G = 0 0 1
b1 1 3 (1,−1,0) row(path) space
1 1 1
(measured)
D
b3 x1 b1 null space
C
2 (unmeasured) x1
G x2 = b2
B
x3 b3
x3
Figure 14.3 illustrates how rank deficiency can occur. There are three end hosts
(A, B, and C) on the overlay, three links (1, 2, and 3) and three paths between the
end hosts. We cannot uniquely solve x1 and x2 because links 1 and 2 always appear
together. We know their sum, but not their difference.
Figure 14.3 illustrates the geometry of the linear system, with each variable xi
T
as a dimension. The vectors f˛ 1 1 0 g comprise N .G/, the null space of G.
No information about the loss rates for these vectors is given by (14.3). Meanwhile,
there is an orthogonal row(path) space of G, R.G T /, which for this example is a
T T
plane f˛ 1 1 0 C ˇ 0 0 1 g. Unlike the null space, the loss rate of any vector
on the row space can be uniquely determined by (14.3).
To separate the identifiable and unidentifiable components of x, we decompose
x into x D xG CxN , where xG 2 R.G T / is its projection on the row space and and
xN 2 N .G/ is its projection on the null space (i.e., GxN D 0). The decomposition
of Œx1 x2 x3 T for the sample overlay is shown below.
2 3 2 3 2 3
1 0 b1 =2
.x1 C x2 / 4 5
xG D 1 C x3 405 D 4b1 =25 (14.4)
2
0 1 b2 :
14 Algebraic Approaches for Scalable End-to-End Monitoring and Diagnosis 327
2 3
1
.x1 x2 / 4 5
xN D 1 (14.5)
2
0:
Thus the vector xG can be uniquely identified, and contains all the information
we can know from (14.3) and the path measurements. The intuition of our scheme
is illustrated through virtual links in [8].
Because xG lies in the k-dimensional space R.G T /, only k-independent equa-
tions of the r equations in (14.3) are needed to uniquely identify xG . We measure
these k paths to compute xG . Since b D Gx D GxG C GxN D GxG , we can
compute all elements of b from xG , and thus obtain the loss rate of all other paths.
Overlay Network
Operation Center
End hosts
Here we describe the basic static monitoring path selection algorithm of TOM;
incremental update and load balancing extensions are described in [9]. The basic
algorithms involve two steps. First, we select a basis set of k paths to monitor. Such
selection only needs to be done once at setup. Then, based on continuous monitoring
of the selected paths, we calculate and update the loss rates of all other paths.
Gx N
N G D b; (14.6)
To compute the path loss rates, we must find a solution to the underdetermined linear
N G D b.
system Gx N The vector bN comes from measurements of the paths. Given
procedure SelectPath(G)
1 for every row(path) v in G do
2 RO 12 D RT Gv N T D QT vT
3 O
R22 D kvk kRO 12 k2
2
4 if RO 22 ¤ 0 then
5 Select v as a measurement
" # path
R RO 12 GN
6 Update R D and GN D
O
0 R22 v
In this section, we first give formal definition on MILS and introduce the algorithms
to identify and infer the properties of MILSes. For simplicity, we first study link
property inference for undirected graphs. We then turn to the more realistic problem
of inferring link properties in directed graphs.
As mentioned before, we know that not all the links (or the corresponding variables
in the algebraic model) are uniquely identifiable. Thus our purpose is to find the
smallest path segments with loss rates that can be uniquely identified through end-
to-end path measurements. We introduce minimal identifiable link sequence or MILS
to define such path sequences. These path sequences can be as short as a single
physical link, or as long as an end-to-end path. Our methods are unbiased, and work
with any network topology. This provides the first lower bound on the granularity at
which properties of path segments can be uniquely determined. With this informa-
tion, we can accurately locate what link (or set of links) causes any congestion or
failures.
Figure 14.5 illustrates some examples for undirected graphs. In the top figure,
we cannot determine the loss rates of the two physical links separately from one
path measurement. Therefore we combine the two links together to form one MILS.
In the middle figure, three independent paths traverse three links. Thus each link is
identifiable, and is an MILS. In the bottom figure, there are five links and four paths.
Each path is an MILS, since no path can be written as a sum of shorter MILSes. But
link 3 can be presented as .20 C 30 10 40 /=2, which means link 3 is identifiable,
and there are five MILSes. These examples show three features of the MILS set:
The MILSes may be linearly dependent, as in the bottom example. We can shrink
our MILS set to a basis for the path space by removing such linear dependence,
e.g., by removing the MILS c in the bottom example in Figure 14.5. But it is
helpful to keep such links for diagnosis.
Some MILSes may contain other MILSes. For instance, MILS e is contained in
MILSes b and c in the bottom example.
330 Y. Zhao and Y. Chen
1’
G = [1 1] 1 2 a
Rank(G) = 1
⇒
⎡1 1 0 ⎤ 1
⎢ ⎥ a
G = ⎢1 0 1 ⎥ 1’ 2’
⎢0 1 1 ⎥ ⇒
⎣ ⎦ b c
2 3
Rank(G) = 3 3’
b
⎡1 1 0 0 0⎤
⎢0 1 1 0 1⎥⎥ 1 3’ 4
G=⎢ 1’ 4’ a e
⎢1 0 1 1 0⎥ 3 ⇒
⎢ ⎥ 2 5
⎣0 0 0 1 1⎦ 2’
Rank(G) = 4 c
As we have defined them, MILSes satisfy two properties: they are minimal, i.e.
they cannot be decomposed into shorter MILSes; and they are identifiable, i.e. they
can be expressed as linear combinations of end-to-end paths. Algorithm 1 finds
all possible MILSes by exhaustively enumerating the link sequences and checking
each for minimality and identifiability. An identifiable link sequence on a path will
be minimal if and only if it does not share an endpoint with a MILS on the same
path. Thus as we enumerate the link sequences on a given path in increasing order of
size, we can track whether each link is the starting link in some already-discovered
MILS, which allows us to check for minimality in constant time. To test whether a
link sequence is identifiable, we need only to make sure that the corresponding path
vector v lies in the path space. Since Q is an orthonormal basis for the path space, v
will lie in the path space if and only if kvk D kQT vk.
Now we analyze the computational complexity of identifying MILSes. If a link
sequence contains i links, then v will contain only i non-zeros, and it will cost
O.i k/ time to compute kQT vk. This cost dominates the cost of checking for
minimality, and so the overall cost to check whether one link subsequence is an
MILS will be at worst O.i k/. On a path of length l, there are O.l 2 / link subse-
quences, each of which costs at most O.l k/ time to check, so the total time to find
all the MILSes on one end-to-end path is at most O.kl 3 /. However, we can further
reduce the complexity from O.k l 3 / to O.k l 2 / using dynamic programming. If
we check every end-to-end path in the network, the overall complexity of Algorithm
1 will then be O.r k l 2 /. However, our simulations and Internet experiments
show that only a few more MILSes are obtained from scanning all r end-to-end
paths than from scanning only the k end-to-end paths which are directly monitored.
Furthermore, each physical link used by the network will be used by one of the k
monitored paths, so the MILSes obtained from this smaller set of paths do cover ev-
ery physical link. Therefore, in practice, we scan only the k monitored paths, which
costs O.k 2 l 2 / time, and we accept a slight loss of diagnosis granularity.
Once we have identified all the MILSes, we need to compute their loss rates. We
do this by finding a solution to the underdetermined linear system Gx N G D bN (see
[9]). For example, in Figure 14.6, xG D . 2x1 Cx32 Cx3 ; x1 C2x3 2 x3 ; x1 x23C2x3 /T .
Obviously, xG shows some identifiable vectors in R.G/; however, they may not be
MILSes. Then for each MILS with vector v, the loss rate is vT xG . The elements of
x2 Row(path) space ⎡1 1 0⎤
(identifiable) G=⎢ ⎥
A ⎣1 0 1⎦
(−1,1,1)
1 (1,1,0) ⎡ 2 1 ⎤
p1 p2 ⎢ 2 ⎥
⎢ 6 ⎥
D x1 Q = ⎢⎢ 2
−
1 ⎥
2 3
(0,0,1)
⎢ 2 6 ⎥⎥
B ⎢ 0
2 ⎥
C x3 (1,0,1) ⎢⎣ 3 ⎥⎦
xG need not be the real-link loss rates: only the inner products vT xG are guaranteed
to be unique and to correspond to real losses. We also note that because loss rates
in the Internet remain stable over time scales on the order of an hour [32], the path
measurements in bN need not be taken simultaneously.
Incoming Outgoing
A A Links Links
⎤1 2 3 4 5 6 ⎤
4 1 ⎥1 0 0 0 1 0 ⎥
⎥1 0 0 0 0 1 ⎥
N N G = ⎥0 1 0 1 0 0 ⎥
3 ⎥0 1 0 0 0 1 ⎥
2 ⎥ ⎥
⎥0 0 1 1 0 0 ⎥
6 5 ⎦0 0 1 0 1 0 ⎦
B C B C Rank(G) = 5
x T uN D zT GuN D zT GwN D x T wN :
Therefore, if the link sequence includes an incoming link for node N , it must also
include an outgoing link. Thus, no identifiable link sequence may have an endpoint
at an interior network node. This means that the only identifiable link sequences are
loops and end-to-end paths. t
u
Routing loops are rare in the Internet, thus given Theorem 1, each path is an
MILS and there are no others. This means that there are no individual links or sub-
paths whose loss rates can be exactly determined from end-to-end measurements.
Next, we will discuss some practical methods to get finer level unbiased inference
on directed graphs, such as the Internet.
Considering the simple directed graph in Figure 14.7, the problem of determining
link loss rates is similar to the problem of breaking a deadlock: if any of the in-
dividual links can be somehow measured, then loss rates of all other links can be
determined through end-to-end measurements. Since link loss rates cannot be neg-
ative, for a path with zero loss rate, all the links on that path must also have zero loss
rates. This can break the deadlock and help solve the link loss rate of other paths.
We call this inference approach the good path algorithm. Note that this is a fact in-
stead of an extra assumption. Our PlanetLab experiments as well as [32] show that
more than 50% of paths in the Internet have no loss.
In addition, we can relax the definition of “good path” and allow a negligible
loss rate of at most (e.g., D 0:5%, which is the threshold for “no loss” in [32]).
Then again it becomes a trade-off between accuracy and diagnosis granularity, as
depicted in our framework. Note that although the strict good path algorithm cannot
be applied to other metrics such as latency, such bounded inference is generally
applicable.
As illustrated in the second stage of Figure 14.8, there are two steps for identify-
ing MILSes under directed graphs. First, we find all the good paths in G and thus
establish some good links. We remove these good links and good paths from G to
get a submatrix G 0 . Then we apply Algorithm 8 to G 0 to find all lossy MILSes and
their loss rates in G. For the good links which are in the middle of lossy MILSes
identified, we add them back so that MILSes are consecutive. In addition, we apply
the following optimization procedures to get Q quickly for the identifiability test
(step 10 of Algorithm 8).
334 Y. Zhao and Y. Chen
Good path
Select a
algorithm on G Reduced
Select a basis of G’’: Get all
Measure paths G’’
basis of G '=QR MILSes
topology
G,G, for Estimated and their
to get G
monitoring loss rates Reduced loss rates
for all Good path paths G ’
paths in G algorithm on G
We remove all the good links from GN and get a smaller submatrix G 00 than G 0 .
By necessity, G 00 contains a basis of G 0 . We can then use the small matrix G 00 to
do QR decomposition and thus get Q0 . Since G 00 is usually quite small even for
G from a reasonably large overlay network, such optimization approach makes the
LEND very efficient for online diagnosis. In Figure 14.9, we use a simple topology
to show the matrices computed in the whole process. The path from C to B is a
good path and thus links 2 and 6 are good links.
14.6 Evaluations
In this section, we mainly describe the scalability of the monitoring selection algo-
rithm in TOM (e.g., the study of rank k) and some real Internet experiments of TOM
and LEND using PlanetLab. Meanwhile, a lot of extensive accuracy, load balancing,
and other studies in both simulation and real experiments can be found in [9, 34].
14 Algebraic Approaches for Scalable End-to-End Monitoring and Diagnosis 335
Α 1 2 3 4 5 6
⎡1 0 0 0 1 0⎤ 1 3 4 5
⎢1 0⎥⎥ ⎡1 0 0 1⎤
4 1 ⎢ 0 0 0 0 ⎢1 0 0 0⎥⎥
G = ⎢0 1 0 1 0 0⎥ G'= ⎢
⎢ ⎥ ⎢0 0 1 0⎥
6 2 ⎢0 1 0 0 0 1⎥ ⎢ ⎥
⎢0 0 1 1 0 0⎥ ⎢0 1 1 0⎥
3 5
⎢
⎣⎢0 0 1 0 1
⎥
0⎦⎥ ⎣⎢0 1 0 1⎥⎦
Β C
1 2 3 4 5 6
⎡1 0 0 0 1 0⎤ 1 3 4 5 1 3 4 5
⎢1 0 0 0 0 1⎥⎥ ⎡1 0 0 1⎤ ⎡1 0 0 1⎤
G = ⎢⎢ ⎢
G'' = ⎢1 0 0 0⎥⎥ ⎢
G' = ⎢1 0 0 0⎥⎥
0 1 0 1 0 0⎥
⎢ ⎥ ⎢0 0 1 0⎥ ⎢0 0 1 0⎥
⎢0 1 0 0 0 1⎥ ⎢ ⎥ ⎢ ⎥
⎢⎣0 0 1 1 0 0⎥⎦ ⎣0 1 1 0⎦ ⎣0 1 1 0⎦
9000
original measurement
8000 regression on n
regression on nlogn
regression on n1.25
7000 regression on n1.5
regression on n1.75
6000
Rank of path space (k)
5000
4000
3000
2000
1000
0
100 200 300 400 500 600 700 800 900 1000
Number of end hosts on the overlay (n)
(a) Hierarchical model of 20K nodes
4
7x 10
original measurement
regression on n
6 regression on nlogn
regression on n1.25
regression on n1.5
regression on n1.75
5
Rank of path space (k)
0
100 200 300 400 500 600 700 800 900 1000
Number of end hosts on the overlay (n)
(b) A real topology of 284,805 routers
For a between 1.1 and 2.0, xi nt ercept varies between 0.728 and 0.800. That is, the
fitted function intersects the data about 3/4 of the way across the domain for a wide
range of exponents (including the exponents 1.25, 1.5, and 1.75).
Thus conservatively speaking, we have k D O.n log n/.
14 Algebraic Approaches for Scalable End-to-End Monitoring and Diagnosis 337
Note that such trend still holds when the end hosts are sparsely distributed in the
Internet, e.g., when each end host is in a different access network. One extreme case
is the “star” topology – each end host is connected to the same center router via its
own access network. In such a topology, there are only n links. Thus k D O.n/.
Only topologies with very dense connectivity, like a full clique, have k D O.n2 /.
Those topologies have little link sharing among the end-to-end paths.
The key observation is that when n is sufficiently large, such dense connectivity
is very unlikely to exist in the Internet because of the power-law degree distribution.
Tangmunarunkit et al. found that link usage, as measured by the set of node pairs
(source–destination pairs) whose traffic traverses the link, also follows a power-law
distribution, i.e., there is a very small number of links that are on the shortest paths
of the majority of node pairs. So there is a significant amount of link sharing among
the paths, especially for backbone links, customer links, and peering links.
We deployed and evaluated our TOM and LEND systems on 135 PlanetLab hosts
over the world [26]. Each host is from a different institute. About 60% of hosts are in
the USA and others are distributed mostly in Europe and Asia. There are altogether
135 134 D 18; 090 end-to-end paths among these end hosts. In our experiments,
we measured all the paths for validation. But, in practice, we only need to measure
the basis set of on average 5,706 end-to-end paths, determined by the TOM system.
The measurement load can be evenly distributed among the paths with the technique
in [9] so that each host only needs to measure about 42 paths.
In April 2005, we ran the experiments 10 times, at different times of night and
day. Below we report the average results from the 10 experiments (see Table 14.2).
For the total of 135 134 D 18; 090 end-to-end paths, after removing about 65.5%
good paths containing about 70.5% good links, there are only 6450 paths remaining.
The average length of lossy MILSes on bad paths is 3.9 links or 2.3 virtual links.
350
3PM
6PM
300 9PM
1AM
200
150
100
50
0
0 5 10 15 20
Length of lossy MILSes
The diagnosis granularity of lossy paths is a little high: 3.8. But we believe it
is reasonable and acceptable for the following two reasons. First, in the edge net-
works, the paths usually have a long link chain without branches. For example, all
paths starting from planetlab1.cs.northwestern.edu go through the same five first
hops. If we use virtual link as the unit, we find the granularity is reduced to about
2.3 virtual links. This shows our LEND approach can achieve good diagnosis gran-
ularity comparable to other more biased tomography approaches, while achieving
high accuracy.
Second, we find that there exist some very long lossy MILSes as illustrated in
Figure 14.11, which shows the distribution of the length in physical links of lossy
MILSes measured in different time periods of a day (US Central Standard Time).
For example, some MILSes are longer than 10 hops. Such long lossy MILSes occur
in relatively small overlay networks because some paths do not overlap any other
paths.
We can further apply Gibbs sampling approach [25] based on the MILSes found
and obtain the lower bound on the diagnosis granularity [34], which is 1.9 physical
links and obviously one hop with respect to virtual links. However, accuracy will
be sacrificed to some extent as shown in [34]. Nevertheless, by combining both
statistic approaches and our LEND system, we provide the full flexibility to trade
off between granularity and accuracy.
14.7 Conclusions
In this chapter, we design, implement, and evaluate algebraic approaches for adap-
tive scalable overlay network monitoring (TOM) and diagnosis (LEND).
For an overlay of n end hosts, we selectively monitor a basis set of O.n log n/
paths which can fully describe all the O.n2 / paths. Then the measurements of the
basis set are used to infer the loss rates of all other paths. Our approach works in
14 Algebraic Approaches for Scalable End-to-End Monitoring and Diagnosis 339
real time, offers fast adaptation to topology changes, distributes balanced load to
end hosts, and handles topology measurement errors. Both simulation and Internet
implementation yield promising results.
We also advocate the non-biased end-to-end network diagnosis paradigm which
gives smooth tradeoff between accuracy and diagnosis granularity when combined
with various statistical assumptions. We introduce the concept of minimal identifi-
able link sequence and propose the good path algorithms to leverage measurement
snapshots to effectively diagnose for directed graphs. Both simulation and Planet-
Lab experiments show that we can achieve fine level diagnosis with high accuracy
in near real time.
References
1. Adams, A., et al.: The use of end-to-end multicast measurements for characterizing internal
network behavior. In: IEEE Communications (May, 2000)
2. Akamai Inc.: Technology overview. http://www.akamai.com/en/html/technology/overview.
html
3. Anagnostakis, K., Greenwald, M., Ryger, R.: cing: Measuring network-internal delays using
only existing infrastructure. In: IEEE INFOCOM (2003)
4. Anderson, E., et al.: LAPACK Users’ Guide, third edn. Society for Industrial and Applied
Mathematics, Philadelphia, PA (1999)
5. Bu, T., Duffield, N., Presti, F., Towsley, D.: Network tomography on general topologies. In:
ACM SIGMETRICS (2002)
6. Caceres, R., Duffield, N., Horowitz, J., Towsley, D.: Multicast-based inference of network-
internal loss characteristics. IEEE Transactions in Information Theory 45 (1999)
7. Caceres, R., Duffield, N., Horowitz, J., Towsley, D., Bu, T.: Multicast-based inference of
network-internal characteristics: Accuracy of packet loss estimation. In: IEEE INFOCOM
(1999)
8. Chen, Y., Bindel, D., Katz, R.H.: Tomography-based overlay network monitoring. In: ACM
SIGCOMM Internet Measurement Conference (IMC) (2003)
9. Chen, Y., Bindel, D., Song, H., Katz, R.H.: An algebraic approach to practical and scalable
overlay network monitoring. In: ACM SIGCOMM (2004)
10. Chen, Y., Lim, K., Overton, C., Katz, R.H.: On the stability of network distance estimation. In:
ACM SIGMETRICS Performance Evaluation Review (PER) (Sep. 2002)
11. Chua, D.B., Kolaczyk, E.D., Crovella, M.: Efficient monitoring of end-to-end network proper-
ties. In: IEEE INFOCOM (2005)
12. Coates, M., Hero, A., Nowak, R., Yu, B.: Internet Tomography. IEEE Signal Processing Mag-
azine 19(3), 47–65 (2002)
13. Demmel, J.: Applied Numerical Linear Algebra. SIAM (1997)
14. Duffield, N.: Simple network performance tomography. In: ACM SIGCOMM Internet Mea-
surement Conference (IMC) (2003)
15. Duffield, N., Horowitz, J., Towsley, D., Wei, W., Friedman, T.: Multicast-based loss inference
with missing data. IEEE Journal of Selected Areas of Communications 20(4) (2002)
16. Duffield, N., Presti, F., Paxson, V., Towsley, D.: Inferring link loss using striped unicast probes.
In: IEEE INFOCOM (2001)
17. Floyd, S., Jacobson, V.: Random early detection gateways for congestion avoidance.
IEEE/ACM Transactions on Networking 1(4) (1993)
18. Francis, P., et al.: IDMaps: A global Internet host distance estimation service. IEEE/ACM
Trans. on Networking (2001)
19. Golub, G., Loan, C.V.: Matrix Computations. The Johns Hopkins University Press (1989)
340 Y. Zhao and Y. Chen
20. Govindan, R., Tangmunarunkit, H.: Heuristics for Internet map discovery. In: IEEE INFOCOM
(2000)
21. Mahajan, R., Spring, N., Wetherall, D., Anderson, T.: User-level internet path diagnosis. In:
ACM SOSP (2003)
22. Medina, A., Matta, I., Byers, J.: On the origin of power laws in Internet topologies. In: ACM
Computer Communication Review (2000)
23. Ng, T.S.E., Zhang, H.: Predicting Internet network distance with coordinates-based ap-
proaches. In: Proc.of IEEE INFOCOM (2002)
24. Ozmutlu, H.C., et al.: Managing end-to-end network performance via optimized monitoring
strategies. Journal of Network and System Management 10(1) (2002)
25. Padmanabhan, V., Qiu, L., Wang, H.: Server-based inference of Internet link lossiness. In:
IEEE INFOCOM (2003)
26. PlanetLab: http://www.planet-lab.org/
27. R.A.Brualdi, Pothen, A., Friedland, S.: The sparse basis problem and multilinear algebra.
SIAM Journal of Matrix Analysis and Applications 16, 1–20 (1995)
28. Ratnasamy, S., et al.: Topologically-aware overlay construction and server selection. In: Proc.
of IEEE INFOCOM (2002)
29. Shavitt, Y., Sun, X., Wool, A., Yener, B.: Computing the unmeasured: An algebraic approach
to Internet mapping. In: IEEE INFOCOM (2001)
30. Song, H., Qiu, L., Zhang, Y.: Netquest: A flexible framework for lange-scale netork measure-
ment. In: ACM SIGMETRICS (June 2006)
31. Tang, C., McKinley, P.: On the cost-quality tradeoff in topology-aware overlay path probing.
In: IEEE ICNP (2003)
32. Zhang, Y., et al.: On the constancy of Internet path properties. In: Proc. of SIGCOMM IMW
(2001)
33. Zhao, Y., Chen, Y.: A suite of schemes for user-level network diagnosis without infrastructure.
In: IEEE INFOCOM (2007)
34. Zhao, Y., Chen, Y., Bindel, D.: Towards unbiased end-to-end network diagnosis. In: ACM
SIGCOMM (2006)
Part III
Emerging Applications
Chapter 15
Network Coding and Its Applications
in Communication Networks
Alex Sprintson
15.1 Introduction
15.1.1 Motivation
A. Sprintson ()
Texas A&M University, College Station, TX 77843, Texas
e-mail: [email protected]
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 343
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 15,
c Springer-Verlag London Limited 2010
344 A. Sprintson
a b c d
s1 s2 s1 s2 s1 s2 s1 s2
1
a b a b
1
v1 v1 v1 v1
b
1 1 1 a a b a a⊕b b
v2 v2 v2 v2
a a⊕b a⊕b
1
1 b
t1 t2 t1 t2 t1 t2 t1 t2
Fig. 15.1 Basic network coding example: (a) Original network; (b) A multicast tree with a root at
node s1 ; (c) A multicast tree with a root at node s1 ; (d) A feasible network coding scheme
To demonstrate the advantage of the network coding technique, consider the net-
work depicted in Figure 15.1(a). The network includes two information sources, s1
and s2 , and two terminals, t1 and t2 . We assume that all edges of the network are
of unit capacity, i.e., each edge can transmit one packet per time unit. With the tra-
ditional approach, the packets are forwarded over two Steiner trees,1 such that the
first tree forwards the packets generated by source s1 , while the second tree for-
wards packets generated by node s2 . However, the network does not contain two
edge-disjoint Steiner trees with roots in s1 and s2 ; hence the multicast connection
with two information sources cannot be implemented using traditional methods.
For example, the trees depicted in Figures 15.1(b) and (c) share the bottleneck edge
.v1 ; v2 /. Figure 15.1(d) shows that this conflict can be resolved by employing the
network coding technique. To demonstrate this approach, let a and b be the packets
generated by the information sources s1 and s2 , respectively, at the current commu-
nication round. Both packets are sent to the intermediate node v1 , which generates
a new packet a ˚ b, which is then sent to both nodes t1 and t2 . It is easy to verify
that both terminal nodes can decode the packets a and b from the packets received
over their incoming edges.
The network coding technique can also be useful for minimizing the delay of
data delivery from the source to the terminal nodes [9]. For example, consider the
network depicted in Figure 15.2(a). Suppose that each edge can transmit one packet
per time unit and that the delay of each edge is also one time unit. Figures 15.2(b)
and (c) show two edge-disjoint Steiner trees that connect s to the terminals t1 , t2 ,
and t3 . However, one of the trees is of depth three, and, as a result, terminal t2 will
receive one of the packets after a delay of three time units. It can be verified that
any scheme that does not employ network coding results in a delay of three time
units. Figure 15.2(d) shows a network coding solution which delivers the data with
the delay of just two time units.
1
A Steiner tree is a tree that connects the source node with the terminals and may include any
number of other nodes.
15 Network Coding and Its Applications in Communication Networks 345
a b c d
s s s s
b
1 1 a a b
1 a a⊕b
t1 t3 t1 t3 t1 b t3 t1 b t3
1
a
1 1 a b a⊕b
t2 t2 t2 t2
Fig. 15.2 Delay minimization with network coding: (a) Original network; (b) First Steiner tree
with a root at node s; (c) Second Steiner tree with a root at node s; (d) A network coding scheme
a b
a a
s1 v s2 s1 v s2
b b
s1 v s2 s1 v s2
b a⊕b a⊕b
s1 v s2 s1 v s2
a
s1 v s2
Fig. 15.3 Reducing energy consumption with network coding: (a) traditional approach and
(b) network coding approach
The network coding technique can also be employed to minimize the number of
transmissions in wireless networks [36]. For example, consider the wireless network
depicted in Figure 15.3. The network contains two nodes s1 and s2 that want to
exchange packets through an intermediate relay node v. More specifically, node s1
needs to send packet a to s2 and node s2 needs to send packet b to s1 . Figure 15.3(a)
shows a traditional routing scheme that requires four transmissions. Figure 15.3(b)
shows a network coding scheme in which the intermediate node v first obtains two
packets a and b from s1 and s2 and then generates a new packet a˚b and broadcasts
it to both s1 and s2 . This scheme requires only three transmissions. The example
shows that the network coding technique can take advantage of the broadcast nature
of wireless spectrum medium to minimize the number of transmissions.
As demonstrated by the above examples, network coding has several benefits for
a broad range of applications in both wired and wireless communications networks.
The goal of this chapter is to describe the network coding fundamentals as well as
to show a broad range of applications of this technique.
346 A. Sprintson
Network coding research was initiated by a seminal paper by Ahlswede et al. [1]
and has since then attracted significant interest from the research community. Many
initial works on the network coding technique focused on establishing multicast
connections. It was shown in [1, 28] that the capacity of the network, i.e., the max-
imum number of packets that can be sent from the source s to a set T of terminals
per time unit, is equal to the minimum capacity of a cut that separates the source
s and a terminal t 2 T . In a subsequent work, Koetter and Médard [25] developed
an algebraic framework for network coding and investigated linear network codes
for directed graphs with cycles. This framework was used by Ho et al. [18] to show
that linear network codes can be efficiently constructed through a randomized al-
gorithm. Jaggi et al. [21] proposed a deterministic polynomial-time algorithm for
finding feasible network codes in multicast networks. Network coding for networks
with cycles has been studied in [4, 12]. Network coding algorithms resilient to ma-
licious interference have been studied in [20, 24, 35].
While there are efficient polynomial-time algorithms for network code construc-
tion in multicast settings, finding efficient network codes in non-multicast scenarios
is a more difficult problem [33]. The complexity of several general network coding
problems has been analyzed by Lehman and Lehman [27]. Dougherty el al. [11]
showed that linear network codes are insufficient for achieving the capacity of a
network with multiple unicast connections.
The applications of network coding in wired and wireless communication net-
works have been the subject of several recent studies. Chou and Wu [9] discussed
implementation of network coding in content distribution networks. They discussed
several issues such as synchronization, varying delay, and traffic loss. The advan-
tages of network coding in large-scale peer-to-peer content distribution systems have
been studied in [14, 15]. Network coding techniques for improving the performance
of wireless networks have been studied in [23], [6], and [22].
Comprehensive surveys on the network coding techniques are available in some
recent books [13, 19, 37].
In this section we describe the network model and present the basic definitions of
the network coding technique. Then, we present an algebraic framework for multi-
cast connections. Finally, we present deterministic and randomized algorithms for
construction of efficient network codes.
h.i /
h D lim sup : (15.1)
i !1 i
For example, the network depicted in Figure 15.1(a) can deliver two packets per
time unit to each terminal; hence its capacity is equal to 2. Indeed, the network
coding scheme depicted in Figure 15.1(d) can deliver two new packets at every
communication round. For this network, it holds that h.1/ D h , which, in turn,
implies that h D h.ii / . The last property holds for any acyclic communication
network, but it does not necessarily hold for a network that contains cycles. To see
this, consider the network N.G; s; T / depicted in Figure 15.4(a). For this network, it
is easy to verify that one round of communication is insufficient for delivering two
packets to both terminals t1 and t2 . Figure 15.4(b) shows a network coding scheme
a s b s
1 1 ai bi
v1 v2 v1
ai bi v2
ai if i = 1
1 xi =
v3 v4 v3 v4 ai ⊕ yi−1 otherwise
1 yi
1 1 1 xi yi bi if i = 1
1 1 ai bi yi =
xi bi ⊕ xi−1 otherwise
v5 v6 v5 v6
1
1 xi yi
t1 t2 t1 t2
Fig. 15.4 A coding network with cycles: (a) Original network (b) A feasible network coding
scheme
2
For a definition of finite field see, e.g., [31].
348 A. Sprintson
that can transmit 2n packets over n C 1 rounds; hence the capacity of the network is
equal to 2. In particular, at the first round, node v3 forwards the packet ai received
over its incoming edge .v1 ; v3 /, i.e., x1 D a1 . Then, for each round i , i > 1, node
v3 generates a new packet by computing bitwise XOR between ai and yi 1 . Node
v4 also generates a new packet by computing bitwise XOR between bi and xi 1 . It
is easy to verify that the destination nodes t1 and t2 can decode the packets sent by
the source node after a delay of one round.
In this section, we present a formal definition of a linear network code. For clar-
ity, we assume that the underlying network graph G.V; E/ is acyclic. As discussed
above, such networks are easier to analyze, because we only need to consider a sin-
gle communication round. We also assume that exactly one packet is sent over each
edge and that each node must receive all packets from its incoming edges before
sending a packet on its outgoing edges.
Suppose that we would like to transmit h packets R D .p1 ; p2 ; : : : ; ph / over
the multicast network N.G; s; T /. We assume that the source node s has exactly h
incoming edges, indexed by e1 ; e2 ; : : : ; eh , and each terminal t 2 T has h incoming
edges and no outgoing edges. Note that these assumptions can be made without loss
of generality. Indeed, suppose that the second assumption does not hold for some
terminal t 2 T . In this case, we can add a new terminal t 0 , connected with t by h
parallel edges, resulting in an equivalent network. Figure 15.5(a) depicts an example
a network that satisfies these assumptions. For each edge e 2 E we denote by pe the
packet transmitted on that edge. Each incoming edge ei , 1 i h, of the source
node s transmits original packet pi .
Let e.v; u/ 2 E be an edge of the coding network N.G; s; T / and let Me be the
set of incoming edges in G of its tail node v, Me D f.w; v/ j .w; v/ 2 Eg. Then, we
associate with each edge e 0 2 Me a local encoding coefficient ˇe0 ;e 2 Fq D GF .q/.
The local encoding coefficients of the edges that belong to Me determine the packet
pe transmitted on edge e as a function of packets transmitted on the incoming edges
Me of e. Specifically, the packet pe is equal to
X
pe D ˇe0 ;e pe0 ; (15.2)
e 0 2M e
Definition 15.1 (Linear Network Code). Let N.G; s; T / be a coding network and
let Fq D GF .q/ be a finite field. Then, the assignment of encoding coefficients
15 Network Coding and Its Applications in Communication Networks 349
a e1 e2 b c p1 p2
s b(e1,e3) s b(e2,e4) s
e3 e4 p1 p2
v1 v2 v1 v2 v1 v2
e6 e7 p1 p2
v3 v3 v3
b(e6,e9) b(e7,e9) p1 p2
e5 e8
e9 p1⊕ p2
v4 v4 v4
e10 p1⊕ p2 p1⊕ p2
e11
t1 t2 t1 t2 t1 t2
Fig. 15.5 Encoding notation: (a) Original network; (b) Local encoding coefficients; (c) Encoding
of transmitted packets
Figure 15.5(b) demonstrates the local encoding coefficients that form a linear
network code for the coding network depicted in Figure 15.5(a).
Our goal is to find a set of network coding coefficients fˇe0 ;e g that allows each
terminal to decode the original packets R from the packets obtained through its
incoming edges. The assignment of fˇe0 ;e g that satisfies this condition is referred
to as a feasible network code for N.G; s; T /. For example, consider the network
depicted in Figure 15.5(a) and suppose that all operations are performed over field
F2 D GF .2/. Then, the assignment of encoding coefficients ˇe1 ;e3 D ˇe2 ;e4 D
ˇe6 ;e9 D ˇe7 ;e9 D 1 and ˇe1 ;e4 D ˇe2 ;e3 D 0 results in a feasible network code.
The packets transmitted by the edges of the network are shown in Figure 15.5(c).
Note that each packet transmitted over the network is a linear combination of the
original packets R D fp1 ; p2 ; : : : ; ph g generated by the source node s. Accord-
ingly, for each edge e 2 E we define the global encoding vector e D Œ1e he 2
Fhq , that captures the relation between the packet pe transmitted on edge e and the
original packets in R:
X h
pe D pi ie : (15.3)
i D1
In this section, we present the algorithmic framework due to [25] for linear network
coding in acyclic multicast networks and establish its connection to the min-cut-
max-flow theorem.
Let N.G; s; T / be a coding network and let t be one of the terminals in T . We
denote by Et D fet1 ; : : : ; eth g the set of incoming edges of terminal t. We define
the h h matrix Mt as follows:
2 3
e 1
t
6 2 7
6 7
Mt D 6 et 7 : (15.7)
4 ::: 5
e h
t
That is, each row of Mt contains the global encoding vector of one of the in-
coming edges eti of t. We refer to Mt as the transfer matrix. The transfer matrix
captures the relation between the original packets R and the packets received by the
terminal node t 2 T over its incoming edges. For example, for the network depicted
15 Network Coding and Its Applications in Communication Networks 351
in Figure 15.5(a) the transfer matrix Mt1 for the terminal t1 is equal to
ˇe1 ;e3 ˇe2 ;e3
Mt1 D : (15.8)
ˇe1 ;e3 ˇe6 ;e9 C ˇe1 ;e4 ˇe7 ;e9 ˇe2 ;e3 ˇe6 ;e9 C ˇe2 ;e4 ˇe7 ;e9
Terminal t can decode the original packets in R if and only if the transfer matrix
Mt is of full rank, or equivalently, the determinant det.Mt / is not zero. Thus, the
purpose of the network coding scheme is to find the assignment of the coefficients
fˇe0 ;e g that results in a full-rank transfer matrix Mt for each terminal t 2 T .
For example, for the network depicted in Figure 15.5(a), the determinant of the
transfer matrix Mt1 is equal to
det.Mt1 / D ˇe1 ;e3 .ˇe2 ;e3 ˇe6 ;e9 C ˇe2 ;e4 ˇe7 ;e9 /
(15.10)
ˇe2 ;e3 .ˇe1 ;e3 ˇe6 ;e9 C ˇe1 ;e4 ˇe7 ;e9 /:
det.Mt2 / D ˇe2 ;e4 .ˇe1 ;e3 ˇe6 ;e9 C ˇe1 ;e4 ˇe7 ;e9 /
(15.11)
ˇe1 ;e4 .ˇe2 ;e3 ˇe6 ;e9 C ˇe2 ;e4 ˇe7 ;e9 /:
It is easy to verify that the assignment ˇe1 ;e3 D ˇe2 ;e4 D ˇe6 ;e9 D ˇe7 ;e9 D 1 and
ˇe1 ;e4 D ˇe2 ;e3 D 0 result in non-zero values of the determinants of both matrix,
det.Mt1 / and det.Mt2 /.
We observe that the determinant det.Mt / of theQtransfer matrix Mt is a multi-
variate polynomial with variables fˇe0 ;e g. Let P D t 2T det.Mt / be the product of
the determinants of the transfer matrices for each terminal t 2 T . Clearly, if P is
identically equal to 0, then there is no feasible linear network code for N.G; s; T /.
However, it turns out that if P is not identically equal to 0, then it is possible to find
a feasible assignment of coefficients fˇe0 ;e g, provided that the field Fq is sufficiently
large. Specifically, the size q of Fq must be larger than the maximum degree of P
with respect to any variable ˇe0 ;e .
Figure 15.6 presents a procedure, referred to as Procedure F IND S OLUTION,
that finds a non-zero solution for a multivariate polynomial P . The procedure
receives, as input, a non-zero polynomial P .x1 ; x2 ; : : : ; xn / and a finite field
Fq D GF .q/. The procedure iteratively finds the assignments xi D i such that
P .1 ; 2 ; : : : ; n / ¤ 0. At iteration i , the procedure considers a polynomial Pi ob-
tained from P by substituting xj D j for 1 j i 1. Then, we consider Pi
to be a multivariate polynomial in xi C1 ; : : : ; xn whose coefficients are (univariate)
polynomials in xi . Next, we pick a monomial P 0 of Pi and consider its coefficient
352 A. Sprintson
1 P1 .x1 ; x2 ; : : : ; xn / P .x1 ; x2 ; : : : ; xn /
2 For each i D 1 to n do
3 Consider Pi to be a multivariate polynomial in xiC1 ; : : : ; xn whose
coefficients are univariate polynomials in Fq Œxi .
4 Select a monomial P 0 of Pi which is not identically equal to 0
5 Denote P .xi / be a coefficient of P 0 .
6 Choose i 2 Fq such that P .i / ¤ 0
7 Substitute xi D i in Pi and denote the resulting polynomial as
PiC1 .xiC1 ; : : : ; xn /
8 Return 1 ; 2 ; : : : ; n
P .xi /. Since the size q of the finite field is larger than the maximum degree of
variable xi in P .xi /, there exists a value i 2 Fq , such that P .i / is not zero.
Hence, both P 0 jxi D i and, in turn, Pi jxi D i are non-zero polynomials.
For example, suppose we would like to find a solution for polynomial
P .x1 ; x2 ; x3 / D x1 x22 x3 C x12 x22 x3 C x12 x22 x32 over F3 D GF .3/. We consider
P1 .x1 ; x2 ; x3 / D P .x1 ; x2 ; x3 / to be a polynomial in x2 and x3 whose coefficients
are polynomials in x1 . Specifically, we write
P1 .x1 ; x2 ; x3 / D .x1 C x12 /x22 x3 C x12 x22 x32 D P 0 .x1 /x22 x3 C P 00 .x1 /x22 x32 ;
where P 0 .x1 / D x1 C x12 and P 00 .x1 / D x12 . Next, we select a monomial P 0 .x1 /
and find 1 2 Fq such that P 0 .1 / ¤ 0. Note that 1 D 1 is a good choice for
Fq D GF .3/. Next, we set P2 .x2 ; x3 / D P1 .x1 ; x2 ; x3 /jx1 D1 D 2x22 x3 C x22 x32
and proceed with the algorithm.
The following lemma shows the correctness of Procedure F IND S OLUTION.
Lemma 15.1. Let P be a non-zero polynomial in variables x1 ; x2 ; : : : ; xn
over Fq D GF .q/ and let d be the maximum degree of P with respect to any
variable. Let Fq be a finite field of size q such that q > d . Then, Proce-
dure F IND S OLUTION fP .x1 ; x2 ; : : : ; xn /; Fq g returns 1 ; 2 ; : : : ; n 2 Fq such
that P .1 ; 2 ; : : : ; n / ¤ 0.
Proof. (Sketch) We only need to show that at each iteration i; 1 i n, there
exists i 2 Fq such that P .i / ¤ 0. This follows from the fact that P .xi / is
a polynomial of maximum degree d , hence it has at most d roots. Since Fq in-
cludes q > d elements, there must be at least one element i 2 Fq that satisfies
P .i / ¤ 0. t
u
15 Network Coding and Its Applications in Communication Networks 353
Theorems 15.1 and 15.2 show the relation between the algebraic properties of the
transfer matrices Mt ; t 2 T , combinatorial properties of G.V; E/, and the existence
of a feasible network code fˇe0 ;e g.
We begin with the analysis of unicast connections, i.e., the case in which T con-
tains a single terminal node.
Theorem 15.1. Let N.G; s; T / be a coding network, with T D ftg, and h be the
number of packets that need to be delivered from s to t. Then, the following three
conditions are equivalent:
1. There exists a feasible network code for N.G; s; T / and h over GF .q/ for some
finite value of q;
2. The determinant det.Mt / of the transfer matrix Mt is a (multi-variate) polyno-
mial not identically equal to 0;
3. Every cut3 that separates s and t in G.V; E/ includes at least h edges.
Proof. (Sketch) .1/ ! .2/ Suppose that there exists a feasible network code fˇe0 ;e g
for N.G; s; T / and h over GF .q/. This implies that det.Mt / is not zero for fˇe0 ;e g,
which, in turn, implies that det.Mt / as a polynomial in fˇe0 ;e g is not identically
equal to 0.
.2/ ! .1/ Lemma 15.1 implies that there exists a non-zero assignment of the
local encoding coefficients fˇe0 ;e g for N.G; s; T / over a sufficiently large field Fq .
This assignment constitutes a valid network code for N.G; s; T /.
.1/ ! .3/ Suppose that there exists a feasible network code fˇe0 ;e g for
N.G; s; T / and h over GF .q/. By the way of contradiction, assume that there
exists a cut C that separates the source s and terminal t that includes h0 < h
edges. Let 1 ; 2 ; : : : ; h0 be the set of global encoding vectors for the edges that
belong to C . Then, for each incoming edge e of t, it holds that the global encoding
vector of e is a linear combination of 1 ; 2 ; : : : ; h0 . This, in turn, implies that
the global encoding vectors that correspond to incoming edges of t span a subspace
of Fhq of dimension h0 or smaller. This implies that at least two rows of Mt are
linearly dependent and, in turn, that det.Mt / is identically equal to 0, resulting in a
contradiction.
.3/ ! .1/ The min-cut-max-flow theorem implies that there exist h edge-disjoint
paths that connect s and t. Let fˇe0 ;e g be an assignment of the local encoding co-
efficients such that ˇe0 .v;u/;e.u;w/ D 1 only if both e 0 .v; u/ and e.u; w/ belong to the
same path. It is easy to verify that this assignment constitutes a feasible network
code. t
u
The next theorem extends these results for multicast connections.
Theorem 15.2. Let N.G; s; T / be a multicast coding network and let h be the num-
ber of packets that need to be delivered from s to all terminals in T . Then, the
following three conditions are equivalent:
3
A cut in a graph G.V; E/ is a partition of the nodes of V into two subsets V1 and V n V1 . We say
that a cut C D .V1 ; V n V1 / separates nodes s and t if s 2 V1 and t 2 V n V1 .
354 A. Sprintson
1. There exists a feasible network code for N.G; s; T / and h over GF .q/ for some
q.
finite value ofY
2. The product det.Mt / of the determinants of the transfer matrices is a (multi-
t 2T
variate) polynomial which is not identically equal to 0.
3. Every cut that separates s and t 2 T in G.V; E/ includes at least h edges.
Proof. (Sketch) .1/ ! .2/ Similar to the case of unicast connections, the existence
of a feasible network code fˇe0 ;e g for N.G; s; T / and h over GF .q/ implies that the
polynomial det.Mt / is not identically equal to 0 for each t 2 T .
.2/ ! .1/ Lemma 15.1 implies that there exists a non-zero assignment of the
local encoding coefficients fˇe0 ;e g for N.G; s; T / over a sufficiently large field q.
Since this assignment satisfies det.Mt / ¤ 0 for each t 2 T , fˇe0 ;e g is a feasible
network code for N.G; s; T /.
.1/ ! .3/ Note that a feasible network code for the multicast connection
N.G; s; T / is also feasible for each unicast connection N.G; s; ftg/, t 2 T . Then, we
can use the same argument as in Theorem 15.1 to show that every cut that separates
s and t includes at least h edges.
.3/ ! .2/ The min-cut-max-flow theorem implies that for each t 2 T there
exist h edge-disjoint paths that connect s and t. The argument similar to that used
in Theorem 15.1 implies that for each t 2 T Q the polynomial det.Mt / is not iden-
tically equal to 0. This, in turn, implies that t 2T det.Mt / is also not identically
equal to 0. t
u
Theorem 15.2 implies that the capacity of undirected multicast network is equal
to the minimum size of a cut that separates a source s and a terminal t 2 T . Algo-
rithm N ET C ODE 1 depicted in Figure 15.7 summarizes the steps required for finding
a feasible network code for a multicast network.
One of the most important parameters of a network coding scheme is the minimum
required size of a finite field. The field size determines the number of available
linear combinations. The number of such combinations, and, in turn, the required
field size, is determined by the combinatorial structure of the underlying commu-
nication network. For example, consider the network depicted in Figure 15.8. Let
e1 ; : : : ; e4 be the global encoding vectors of edges e1 ; : : : ; e4 . Note that in this
network each pair of .vi ; vj / of the intermediate nodes is connected to a terminal,
hence any two of the global encoding vectors e1 ; : : : ; e4 must be linearly inde-
pendent. Note also that with GF .2/ there exist only three non-zero pairwise linearly
independent vectors of size two: .1 0/, .0 1/, and .1 1/, hence F2 D GF .2/ is in-
sufficient for achieving network capacity. However, it is possible to find a network
coding solution over GF .3/ or a larger field. For example, over GF .3/, the follow-
ing global encoding coefficients are feasible: .1 0/, .0 1/, .1 1/, and .1; 2/.
As mentioned in the previous section, a feasible network code can Q be found by
identifying a non-zero solution of a multivariate polynomial P D t 2T det.Mt /.
As shown in Lemma 15.1, such a solution exists if the size q of the finite field
Fq is larger than the maximum degree of any variableQˇi of P . In this section, we
show that maximum degree of any variable in P D t 2T det.Mt / is bounded by
k D jT j, which implies that a field of size q k is sufficient for finding a feasible
solution to the problem.
In our model we assumed that each edge of the network sends a packet only one
time when it receives a packet from each incoming edge. In this section, for the
purpose of analysis, we assume that the communication is performed in rounds as
e1 e2
s
e3 e6
e4 e5
v1 v2 v3 v4
t1 t2 t3 t4 t5 t6
X
d
e D ei ; (15.12)
i D0
where d is the length of the longest path in the network that starts at node s.
We define an jEj jEj matrix T that captures the information transfer between
different communications rounds. Matrix T is referred to as an adjacency matrix.
ˇei ;ej if ei is a parent edge of ej
T .i; j / D (15.13)
0 otherwise.
For example, the network depicted in Figure 15.9 has the following adjacency
matrix T: 2 3
0 0 ˇe1 ;e3 ˇe1 ;e4 0 0 0
60 0 ˇ 0 7
6 e2 ;e3 ˇe2 ;e4 0 0 7
60 0 0 0 ˇe3 ;e5 ˇe3 ;e6 0 7
6 7
6 7
T D 60 0 0 0 0 0 ˇe4 ;e7 7 (15.14)
6 7
60 0 0 0 0 0 ˇe5 ;e7 7
6 7
40 0 0 0 0 0 0 5
00 0 0 0 0 0
We also define h jEj matrix A and jEj 1 vector Be for each e 2 E as follows:
1 if i D j
A.i; j / D (15.15)
0 otherwise,
a e1 e3 e6
s e5 t
b e2 e4 e7
Note for e3 it holds that ei3 is non-zero only for i D 1. In contrast, edge e7 has
two non-zero vectors, e27 and e37 .
By substituting Equation (15.17) into Equation (15.12) we obtain
e D A .I C T C T 2 C C T d /Be : (15.18)
det.Mt / D det.M0t /;
where
A 0
M0t D :
I T BtT
The proof of Theorem 15.3 involves basic algebraic manipulations and can be found
in [17]. The structure of matrix M0t implies that the maximum degree of any local
4
A matrix T called nilpotent if there exists some positive integer n such that T n is a zero matrix.
358 A. Sprintson
One of the important properties of network coding for multicast networks is that a
feasible network code can be efficiently identified through a randomized algorithm.
A randomized algorithm chooses each encoding coefficient at random with uniform
distribution over a sufficiently large field Fq . To see why a random algorithm works,
recall that the main goal of the network coding algorithm Qis to find a set of encoding
coefficients fˇe0 ;e g that yield a non-zero value of P D t 2T det Mt . Theorem 15.5
bounds the probability of obtaining a bad solution as a function of the field size.
Theorem 15.5. (Schwartz–Zippel). Let P .x1 ; : : : ; xn / be a non-zero polynomial
over F of total degree at most d . Also, let r1 ; : : : ; rn be a set of i.i.d. random vari-
ables with uniform distribution over finite field Fq of size q. Then,
d
P r.P .r1 ; : : : ; rn / D 0/ :
q
The theorem can be proven by induction on the number ofQ variables. As discussed
in the previous section, the degree of each variable ˇe0 ;e in t 2T det Mt is at most
jT j. Let be the total number of encoding coefficients. Thus, if we use a finite field
Fq such that q > 2 k, the probability of finding a feasible solution is at least 50%.
In [17] a tighter bound of .1 jTq j / on the probability of finding a non-zero solution
has been shown.
Random network coding has many advantages in practical settings. In particular,
it allows each node in the network to choose a suitable encoding coefficient in a
decentralized manner without prior coordination with other nodes. Random coding
has been used in several practical implementation schemes [10].
Random network coding can also be used to improve the network robustness to
failures of network elements (nodes or edges) or to deal with frequently changing
topologies. Let N.G; s; T / be the original coding network and let N0 .G 0 ; s; T / be the
network topology
Q resulting from a failure of an edge or node in the network. Further,
let P D t 2T det.Mt / beQ the product of determinants of the transfer matrices in
the N.G; s; T / and P 0 D t 2T det.M0t / be a product of determinants of transfer
matrices in N0 .G 0 ; s; T /. The network code fˇe0 ;e g that can be used in both the
15 Network Coding and Its Applications in Communication Networks 359
original network and in the network resulting from the edge failure must be a non-
zero solution of the polynomial P P 0 . Note that degree of P P 0 is bounded by
2 jT j; hence for a sufficiently large field size the random code can be used for
both networks, provided that after edge failure the network satisfies the minimum-
cut condition. We conclude that with the random network code, the resilience to
failure can be achieved by adding redundancy to the network to guarantee that the
min-cut condition is satisfied. Then, upon a failure (failures) of an edge or node with
high probability the same network code can be used.
5
A topological order is a numbering of the vertices of a directed acyclic graph such that every edge
e.v; u/ 2 E satisfies v < u.
360 A. Sprintson
we denote by Ct the set of active edges of the disjoint paths fPit j 1 i hg. Also,
we denote by Bt the h h matrix whose columns are formed by the global encoding
vectors of edges in Ct .
The main invariant maintained by the algorithm is that the matrix Bt for each
t 2 T must be invertible at every step of the algorithm. In the beginning of the algo-
rithm we assign the original packets R D .p1 ; p2 ; : : : ; ph / to h outgoing edges
of s. When the algorithm completes, for each terminal t 2 T the set of active
edges includes the incoming edges in t. Thus, if the invariant is maintained, then
each terminal will be able to decode the packets in R. We refer to this algorithm as
Algorithm N ET C ODE 2 and present its formal description in Figure 15.10.
An example of the algorithm’s execution is presented in Figure 15.11.
Figures 15.11(a) and (b) show the original network and two sets of disjoint paths
f t1 and f t2 that connect the source node s with terminals t1 and t2 , respectively.
Figures 15.11(c) shows the coding coefficients assigned to edges .s; v1 /, .s; v2 /,
and .s; v3 / after node s has been processed. Note that this is one of several possi-
ble assignments of the coefficients and that it satisfies the invariant. Nodes v1 , v2 ,
and v3 are processed in a straightforward way since each of those nodes has only
one outgoing edge. Figure 15.11(d) shows the processing step for node v4 . This
node has one outgoing edge .v4 ; v6 / and needs to choose two encoding coefficients
ˇ1 D ˇ.v1 ;v4 /;.v4 ;v6 / and ˇ2 D ˇ.v2 ;v4 /;.v4 ;v6 / . In order to satisfy the invariant, the
vector ˇ1 Œ1 0 0T C ˇ2 Œ0 1 0T must not belong to the two subspaces, the first
15 Network Coding and Its Applications in Communication Networks 361
a b c ⎡ ⎤
⎡ ⎤ 1 0 0
1 0 0
s s Bt = ⎣ 0 1 0 ⎦ s Bt = ⎣ 0 1 0 ⎦
2
1
0 0 1 0 0 1
p1 p3
p2
v1 v2 v3 v1 v2 v3 v1 v2 v3
v4 v5 v4 v5 v4 v5
v6 v7 v6 v7 v6 v7
v8 v8 v8
v9 v9 v9
t1 t2 t1 t2 t1 t2
d⎡ ⎤
e⎡ ⎤ ⎡ ⎤
⎡ ⎤ 1 1 ? 1 ? 0
1 ? 0 ? 0 0
Bt = ⎣ 0 1 ? ⎦ Bt = ⎣ 1 ? 0 ⎦
Bt1 = ⎣ 0 ? 0 ⎦ s Bt = ⎣ ? 1 0 ⎦ s 2
1
0 0 ? 0 ? 1
0 ? 1 2
? 0 1
p1 p3 p1 p3
p2 p2
v1 v2 v1 v2 v
p v p1 2p p2 p3 3
p1 2 p2 p3 3
v4 v5 v4 v5
p1+p2
p1 p3 p1 p3
v6 v7 v6 v7
v8 v8
v9 v9
t1 t2 t1 t2
Fig. 15.11 An example of algorithm execution: (a) Original network and a set of edge-disjoint
paths between s and t1 ; (b) A set of edge-disjoint path between s and t2 ; (c) The processing step
for node s; (d) The processing step for node v4 ; (e) The processing step for node v5
subspace defined by vectors Œ1 0 0T and Œ0 1 0T , and the second subspace is
defined by Œ0 1 0T and Œ0 0 1T . Note that if the finite field GF .2/ is used, then
the only feasible assignment is ˇ1 D ˇ2 D 1. For a larger field, there are several
possible assignments of the encoding coefficients. Figure 15.11(e) demonstrates the
processing step for node v5 .
The key step of the algorithm is the selection of local encoding coefficients
fˇe0 ;e j e 0 2 Me g such that the requirement of Line 10 of the algorithm is satis-
fied. Let e be an edge in E, let T .e/
T be the set of destination nodes that
depend on E, and let Me be the set of parent edges of e. Also, consider the
step of the algorithm before edge e is processed and let fBt g be the set of ma-
trices for each t 2 T .e/. Since each matrix Bt is of full rank, there exists an
inverse matrix At D Bt1 . For each t 2 T .e/, let at be a row in At that satisfies
at .P t .e// D 1, i.e., at is a row in At that corresponds to column P t .e/ in Bt
(see Figure 15.12(a)). We also observe that if the column P t .e/ in Bt is substituted
by a column .e/, the necessary and sufficient condition for Bt to remain full rank
is that at .e/ ¤ 0 (see Figure 15.12(b)). Thus, we need to select
P the local encod-
ing coefficients fˇe0 ;e j e 0 2 Me g such that the vector .e/ D e0 2Me ˇe0 ;e .e 0 /
will satisfy .e/ at ¤ 0 for each t 2 T .e/.
The encoding coefficients are selected through Procedure C ODING depicted in
Figure 15.13. The procedure receives, as input, an edge e for which the encoding
coefficients fˇe0 ;e j e 0 2 Me g need to be determined. We denote by g the size of
362 A. Sprintson
a At Bt
1
Γ(P t (e))
1
at 1
× ... = ...
1
1
b At Bt
1
1
Γ(e)
at ?
× ... = ...
1
1
Fig. 15.12 Data structures: (a) A row at in At that satisfies at .P t .e// D 1 (b) A condition
imposed on .e/
1 g jT .e/j
2 Index terminals in T .e/ by t 1 ; t 2 ; : : : ; t g
3 For i D 1 to g 1 do
i
4 e i D P t .e/
5 ˇei ;e 0
6 ˇe1 ;e 1
7 For i D 1 to g do
Pi
8 ui j
j D1 ˇe j ;e .e /
9 If u at i C1 D 1 then
i
10 ˇej ;e 0
11 else
12 For each j; 1 j i do
.e i C1 /at j
13 ˛j ui at j
14 Choose ˛ 0 2 Fq such that ˛ 0 ¤ ˛ j for j; 1 j i
15 For each j; 1 j i do
16 ˇej ;e ˇej ;e ˛ 0
17 ˇej ;e 1
18 Return fˇe0 ;e j e 0 2 Me g
.e i C1 / at j
˛ D ˛j D :
ui at j
Thus, the set Fq n f˛j j1 j i g is not empty. Thus, we choose ˛ 0 2 Fq such that
˛ 0 ¤ ˛ j for j; 1 j i and set ui C1 D ˛ 0 ui C .e i C1 / (by setting coefficients
fˇej ;e g accordingly). By construction, it holds that ui C1 at j ¤ 0 for 1 j i .
jEp j
.N/ D min ;
p2P .jpj 1/
For example, the Steiner strength of the network depicted in Figure 15.14(a) is equal
to 1.5, which is a tight bound for this particular case. It turns out that .N/ deter-
mines the maximum rate of the multicast transmission in the special case in which
the set T [ fsg includes all nodes in the network.
The following theorem is due to Li and Li [29].
Theorem 15.6. Let N.G.V; E/; s; T / be a multicast coding network over an undi-
rected graph G. Let .N/ be the maximum rate of multicast connection using
traditional methods (Steiner tree packing) and let .N / be the maximum rate
achievable by using the network coding approach. Then for the case of V D T [fsg
it holds that
1
.N / .N / D .N / .N /:
2
Otherwise it holds that
1
.N / .N / .N / .N /:
2
Theorem 15.6 shows that the maximum coding advantage of network coding in
undirected networks is upper bounded by two. This is in contrast to the case of
directed networks, where the coding advantage can be significantly higher.
a s1 b s1 c s1
1 1 1 1 1 1
0.5
t1 1 t2 t1 1 t2 t1 t2
0.5
Fig. 15.14 An example of an undirected network: (a) Original undirected network; (b) An orien-
tation that achieves rate 1; (c) An orientation that achieves rate 1.5
15 Network Coding and Its Applications in Communication Networks 365
As discussed in the previous sections, network coding techniques can offer signif-
icant benefits in terms of increasing throughput, minimizing delay, and reducing
energy consumption. However, the implementation of network coding in real net-
works incurs a certain communication and computational overhead. As a result, a
thorough cost–benefit analysis needs to be performed to evaluate the applicability
of the technique for any given network setting. For example, it is highly unlikely
that the network coding technique will be implemented at core network routers due
to the high rate of data transmission at the network core. Thus, finding the network
setting that can benefit from the network coding technique is a challenging prob-
lem by itself. In this section, we discuss the practical implementation of the network
coding technique proposed by Chou et al. [9,34]. The principles of this implementa-
tion were adopted by many subsequent studies [16] and by real commercial systems
such as Microsoft Avalanche.
The content distribution system includes a single information source that gener-
ates a stream of bits that need to be delivered to all terminals. The bits are combined
into symbols. Each symbol typically includes 8 or 16 bits and represents an ele-
ment of a finite field GF .q/. The symbols, in turn, are combined into packets, such
that packet pi is comprised of N symbols 1i ; 2i ; : : : ; N
i
. The packets, in turn, are
combined into generations, each generation includes h packets. In typical settings,
the values of h can vary between 20 and 100. Figure 15.15 demonstrates the process
of creating symbols and packets from the bit stream.
The key idea of the proposed scheme is to mix the packets that belong to the
same generation, the resulting packet is then said to belong to the same generation.
Further, when a new packet is generated, the encoding is performed over individual
symbols rather than the whole packet. With this scheme, the local encoding coeffi-
cients belong to the same field as the symbols, i.e., GF .q/. For example, suppose
that two packets pi and pj are combined into a new packet pl with local encoding
coefficients ˇ1 2 GF .q/ and ˇ2 2 GF .q/. Then, for 1 y N , the y’s symbol
Bit stream 10010010 ... 001010100111 . . . 1001 ... 00100111 . . . 1011 ...
2 h+1
Symbols 1
σ1
1
σ 2 ...
1
σN
2
σ1
2
σ2 ... σN ... σ 1h+1 σ 2h+1 ... σN ...
p1 p2 ph+1
Generation 1 Generation 2
Fig. 15.15 Packetization process: forming symbols from bits and packets from symbols
366 A. Sprintson
X
h
yl D il yi :
i D1
Another key idea of this scheme is to attach the global encoding coefficients
to the packet. These coefficients are essential for the terminal node to be able to
decode the original packets. This method is well suited for settings with random
local encoding coefficients. The layout of the packets is shown in Figure 15.16.
Note that each packet also includes its generation number.
Attaching global encoding incurs a certain communication overhead. The size of
the overhead depends on the size of the underlying finite field. Indeed, the number
of bits needed to store the global encoding vectors is equal to h q. In the practical
case considered in Chou et al. [9], h is equal to 50 and the field size q is equal to
2 bytes, resulting in a total overhead of 100 bytes for packets. With packet size of
1400 bytes, the overhead constitutes approximately 6% of total size of the packet.
If the field size is reduced to 1 byte, then the overhead is decreased to just 3% of the
packet size.
Note that the destination node will be able to decode the original packets after
it receives h or more linearly independent packets that belong to the same genera-
tion. With random network coding, the probability of receiving linearly independent
packets is high, even if some of the packets are lost. The major advantage of the pro-
posed scheme is that it does not require any knowledge of the networking topology
and efficiently handles dynamic network changes, e.g., due to link failures.
The operation of an intermediate network node is shown in Figure 15.17. The
node receives, via its incoming links, packets that belong to different generations.
The packets are then stored in the buffer, and sorted according to their generation
number. At any given time, for each generation, the buffer contains a set of linearly
independent packets. This is accomplished by discarding any packet that belongs to
the span of the packets already in the buffer. A new packet transmitted by the node
is formed by a random linear combination of the packets that belong to the current
generation.
The important design decision of the encoding node is the flushing policy. The
flushing policy determines when a new generation becomes the current generation.
15 Network Coding and Its Applications in Communication Networks 367
There are several flushing policies that can be considered. One possibility is to
change the current generation as soon as a packet that belongs to a new gener-
ation arrives via an incoming link. An alternative policy is to change generation
when all incoming links receive packets that belong to the new generation. The
performance of different flushing policies can be evaluated by a simulation or an
experimental study.
Network coding can benefit peer-to-peer networks that distribute large files (e.g.,
movies) among a large number of users [14]. The file is typically partitioned into a
large number, say k, of chunks, each chunk is disseminated throughout the network
in a separate packet. A target node collects k or more packets from its neighbors and
tries to reconstruct the file. To facilitate the reconstruction process, the source node
typically distributes parity check packets, generated by using an efficient erasure
correction code such as Digital Fountains [5]. With this approach, the target node
can decode the original file from any k different packets out of n > k packets sent
by the source node.6
With the network coding technique each intermediate node forwards linear com-
binations of the received packets to its neighbors (see Figure 15.18). This approach
Original vectors
Generation
number 1 1 1 1
g
0 0 1 γ1
Global g
encoding 0 1 0 γ2
vectors g
1 γ3
0 0
3 2 1 g
σ1 σ1 σ1 σ1
3 2 1 g
σ2 σ2 σ2 σ2
3 2 1 g
Fig. 15.16 Structure of the σN σN σN σN
packet
6
Some efficient coding schemes require slightly more than k packets to decode the file.
368 A. Sprintson
Interface 1
3 3 3 2 2 1
buffer
Interface 3
3 3 1 1 1 1
significantly increases the probability of the successful decoding of the file at the
target node. For example, consider the network depicted in Figure 15.19. In this ex-
ample, the file is split into two chunks, a and b. The source node then adds a parity
check packet, c, such that any two of the packets a, b, and c are sufficient for re-
constructing the original file. Figure 15.19(a) demonstrates a traditional approach
in which each intermediate node forwards packets a, b, and c to its neighbors.
Since there is no centralized control and the intermediate nodes do not have any
knowledge of the global network topology, the routing decision is done at random.
p1 Node 1
+ p1
p2
Node 3
p3 + p3
Node 2
p4 + p2
...
pk
pk+1
...
pn
Suppose that two target nodes t1 and t2 would like to reconstruct the file. Note
that node t1 obtains two original packets, a and b. However, node t2 receives two
copies of the same packet (b), which are not sufficient for a successful decoding
operation. Figure 15.19(b) shows a network coding approach in which the interme-
diate nodes generate new packets by randomly combining the packets received over
their incoming edges. With this approach, the probability that each destination node
receives two linearly independent packets, and hence the probability of successful
decoding operation, is significantly higher.
The distinct property of a wireless medium is the ability of a sender node to broad-
cast packets to all neighboring nodes that lie within the transmission range. In
Section 15.1.1 we presented an example that shows that the network coding tech-
nique allows us to take advantage of the broadcast nature of the wireless medium
to minimize the number of transmissions. This technique has been exploited at a
network scale, in several recent studies. In this section, we will discuss the work of
Katti et al. [23] that presents a new forwarding architecture, referred to as COPE,
for network coding in wireless networks. The new architecture results in a substan-
tial increase of throughput by identifying coding opportunities and sending encoded
packets over the network.
The underlying principle of the COPE architecture is opportunistic listening.
With this approach, all network nodes are set in a promiscuous mode, snooping
on all communications over the wireless medium. The overheard packets are stored
at a node for a limited period of time (around 0.5 s). Each node periodically broad-
casts reception reports to its neighbors to announce the packets which are stored at
this node. To minimize the overhead, the reception reports are sent by annotating
the data packets transmitted by the node. However, a node that has no data packets
to transmit periodically sends the reception reports in special control packets.
a b b b b b
t1 v6 t2 t1 v6 t2
a b b 4a+5b b 2b+3c
b b b b
v3 v4 v5 v3 v4 v5
a b c a b c
a c a c
v1 s v2 v1 s v2
Fig. 15.19 (a) Traditional approach in which intermediate nodes only forward packets received
over their incoming edges; (b) Network coding approach
370 A. Sprintson
Receiver 1 Receiver 2
Need p4 Need p3
Has p1 Has p2 p4
Sender
p4 p3 p2 p1
Receiver 3
Receiver 4
p1
Need p2
Need
Has p2 p3 p1 p3
Has
Once a sender node knows which overheard packets are stored in its neighbors,
it can use this knowledge to minimize the number of transmissions. For example,
consider a sender node and a set of its neighbors depicted in Figure 15.20. In this
example, the sender node needs to deliver four packets p1 ; : : : ; p4 to its neighbors.
For each neighbor, we show the packets it requires, as well as the packets stored in
its cache. The traditional approach requires four transmissions to deliver all packets,
while the network coding approach requires only two transmissions: p1 C p2 C p3
and p1 C p4 , all operations are performed over GF .2n /.
This example shows that the network coding technique has the potential to reduce
the number of transmissions. The problem of selecting an encoding scheme that
minimizes the number of transmissions is referred to as Index Coding. It has been
studied in several recent works [3, 8, 26, 32].
A practical implementation of this approach needs to overcome several chal-
lenges. First, at times of congestion, the reception reports may get lost in collisions.
When the traffic is light, the reception reports might arrive too late and the sender
might need to guess what overheard packets are available at the neighboring nodes.
Further, the destination node might decode packets out of order, which, in turn,
might result in a congestion signal issued by the end-to-end protocol such as TCP.
The experimental results of [23] show that network coding is most helpful when
the network is moderately loaded. In such situations, the network coding technique
can result in a fourfold throughput increase in throughput. The gain is smaller in
underloaded networks due to limited coding opportunities at lower demands.
15 Network Coding and Its Applications in Communication Networks 371
15.9 Conclusion
In this chapter, we discussed the basics of the network coding technique and its
applications in several areas of networking. From a theoretical perspective, net-
work coding is a new fascinating research area that requires tools from different
disciplines such as algebra, graph theory, and combinatorics. It is rich in challeng-
ing problems, many of which are still open. In particular, while multicast problems
are well understood, many problems in the general settings are open. From the prac-
tical perspective, several potential applications of network coding techniques have
been discussed in the literature. The technique has been successfully employed by a
commercial product (Microsoft Avalanche). We believe there is a potential for more
applications that can benefit from this technique.
References
1. R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung. Network Information Flow. IEEE Trans-
actions on Information Theory, 46(4):1204–1216, 2000.
2. R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Networks Flows. Prentice-Hall, NJ, USA, 1993.
3. Z. Bar-Yossef, Y. Birk, T. S. Jayram, and T. Kol. Index Coding with Side Information. In
Proceedings of 47th Annual IEEE Symposium on Foundations of Computer Science, pages
197–206, 2006.
4. A. Barbero and O. Ytrehus. Cycle-logical Treatment for “Cyclopathic Networks”. IEEE/ACM
Transactions on Networking, 14(SI):2795–2804, 2006.
5. J. W. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A Digital Fountain Approach to Reliable
Distribution of Bulk Data. SIGCOMM Comput. Commun. Rev., 28(4):56–67, 1998.
6. S. Chachulski, M. Jennings, S. Katti, and D. Katabi. MORE: Exploiting Spatial Diversity with
Network Coding. In MIT CSAIL Technical Report, 2006.
7. M. Charikar and A. Agarwal. On the Advantage of Network Coding for Improving Network
Throughput. In Proceedings of IEEE Information Theory Workshop, San Antonio, 2004.
8. M. Chaudhry and A. Sprintson. Efficient Algorithms for Index Coding. Computer Communi-
cations Workshops, 2008. INFOCOM. IEEE Conference on, pages 1–4, April 2008.
9. P. Chou and Y. Wu. Network Coding for the Internet and Wireless Networks. Signal Processing
Magazine, IEEE, 24(5):77–85, Sept 2007.
10. P. A. Chou, Y. Wu, and K. Jain. Practical Network Coding. In Proceedings of Allerton Con-
ference on Communication, Control, and Computing, Monticello, IL, October 2003.
11. R. Dougherty, C. Freiling, and K. Zeger. Insufficiency of Linear Coding in Network Informa-
tion Flow. IEEE Transactions on Information Theory, 51(8):2745–2759, 2005.
12. E. Erez and M. Feder. Convolutional Network Codes. In IEEE International Symposium on
Information Theory, 2004.
13. C. Fragouli and E. Soljanin. Network Coding Fundamentals. Now Publishers, Inc, 2007.
14. C. Gkantsidis, J. Miller, and P. Rodriguez. Anatomy of a P2P Content Distribution System
with Network Coding. In IPTPS’06, February 2006.
15. C. Gkantsidis, J. Miller, and P. Rodriguez. Comprehensive View of a Live Network Coding
P2P System. In IMC ’06: Proceedings of the 6th ACM SIGCOMM conference on Internet
measurement, pages 177–188, 2006.
16. C. Gkantsidis and P. Rodriguez. Network Coding for Large Scale Content Distribution. IN-
FOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications
Societies. Proceedings IEEE, 4:2235–2245 vol. 4, March 2005.
372 A. Sprintson
17. T. Ho. Networking from a Network Coding Perspective. Dissertation, Massachusetts Institute
of Technology, 2004.
18. T. Ho, R. Koetter, M. Medard, D. Karger, and M. Effros. The Benefits of Coding over Routing
in a Randomized Setting. In Proceedings of the IEEE International Symposium on Information
Theory, 2003.
19. T. Ho and D. S. Lun. Network Coding: An Introduction. Cambridge University Press,
Cambrige, UK, 2008.
20. S. Jaggi, M. Langberg, S. Katti, T. Ho, D. Katabi, M. Medard, and M. Effros. Resilient Network
Coding in the Presence of Byzantine Adversaries. IEEE Transactions on Information Theory,
54(6):2596–2603, June 2008.
21. S. Jaggi, P. Sanders, P. A. Chou, M. Effros, S. Egner, K. Jain, and L. Tolhuizen. Polynomial
Time Algorithms for Multicast Network Code Construction. IEEE Transactions on Informa-
tion Theory, 51(6):1973–1982, June 2005.
22. S. Katti, D. Katabi, W. Hu, H. S. Rahul, and M. Médard. The Importance of Being Opportunis-
tic: Practical Network Coding for Wireless Environments. In 43rd Annual Allerton Conference
on Communication, Control, and Computing, Allerton, 2005.
23. S. Katti, H. Rahul, D. Katabi, W. H. M. Médard, and J. Crowcroft. XORs in the Air: Practical
Wireless Network Coding. In ACM SIGCOMM, Pisa, Italy, 2006.
24. R. Koetter and F. Kschischang. Coding for Errors and Erasures in Random Network Coding.
Information Theory, IEEE Transactions on, 54(8):3579–3591, Aug. 2008.
25. R. Koetter and M. Medard. An Algebraic Approach to Network Coding. IEEE/ACM Transac-
tions on Networking, 11(5):782 – 795, 2003.
26. M. Langberg and A. Sprintson. On the Hardness of Approximating the Network Coding Capac-
ity. Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pages 315–319,
July 2008.
27. A. Lehman and E. Lehman. Complexity Classification of Network Information Flow Problems.
In Proceedings of SODA, 2004.
28. S.-Y. R. Li, R. W. Yeung, and N. Cai. Linear Network Coding. IEEE Transactions on Infor-
mation Theory, 49(2):371 – 381, 2003.
29. Z. Li and B. Li. Network Coding in Undirected Networks. In Proceedings of 38th Annual
Conference on Information Sciences and Systems (CISS), Princeton, NJ, USA, 2004.
30. Z. Li, B. Li, D. Jiang, and L. C. Lau. On Achieving Optimal Throughput with Network Coding.
INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications
Societies. Proceedings IEEE, 3:2184–2194 vol. 3, March 2005.
31. R. Lidl and H. Niederreiter. Finite Fields. Cambridge University Press, 2nd edition, 1997.
32. E. Lubetzky and U. Stav. Non-linear Index Coding Outperforming the Linear Optimum. In
Proceedings of 48th Annual IEEE Symposium on Foundations of Computer Science, pages
161–168, 2007.
33. M. Médard, M. Effros, T. Ho, and D. Karger. On Coding for Non-multicast Networks. In 41st
Annual Allerton Conference on Communication Control and Computing, Oct. 2003.
34. Y. W. P. A. Chou and K. Jain. Network Coding for the Internet. In IEEE Communication
Theory Workshop, Capri, Italy, 2004.
35. D. Silva, F. Kschischang, and R. Koetter. A Rank-Metric Approach to Error Control in Random
Network Coding. Information Theory, IEEE Transactions on, 54(9):3951–3967, Sept. 2008.
36. Y. Wu, P. Chou, and S.-Y. Kung. Minimum-Energy Multicast in Mobile Ad Hoc Networks
Using Network Coding. IEEE Transactions on Communications, 53(11):1906–1918, Nov.
2005.
37. R. Yeung. Information Theory and Network Coding. Springer, 2008.
Chapter 16
Next Generation Search
Abstract Searching for information is one of the most common tasks that users
of any computer system perform, ranging from searching on a local computer, to a
shared database, to the Internet. The growth of the Internet and the World Wide Web,
the access to an immense amount of data, and the ability of millions of users to freely
publish their own content has made the search problem more central than ever be-
fore. Compared to traditional information-retrieval systems, many of the emerging
information systems of interest, including peer-to-peer networks, blogs, and social
networks among others, exhibit a number of characteristics that make the search
problem considerably more challenging. We survey algorithms for searching infor-
mation in systems that are characterized by a number of such features: the data are
linked in an underlying graph structure, they are distributed and highly dynamic,
and they contain social information, tagging capabilities, and more. We call such
algorithms next-generation search algorithms.
16.1 Introduction
The problem of searching for information in large volumes of data involves deal-
ing with two main challenges: (i) identifying the information need of the users and
designing models that provide an estimate of the relevance of each data item with re-
spect to the user queries, and (ii) designing efficient algorithms that make it possible
to locate the relevant information without having to visit all items in the available
data.
The growth of the Internet and the World Wide Web, the access to an immense
amount of data, and the ability of millions of users to freely publish their own con-
tent has made the search problem more difficult than ever before. The growth of the
available information has an impact on both of the above-mentioned search chal-
lenges. First, the overwhelming wealth of information, which often is very noisy
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 373
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 16,
c Springer-Verlag London Limited 2010
374 D. Donato and A. Gionis
and of low quality, makes it extremely difficult to find relevant items, and even more
difficult to distinguish the items of the highest quality among all the relevant ones.
Second, due to the formidable size and the distributed nature of the data there is
imperative need for efficient search algorithms, especially for those search systems
that are designed to work in an online mode – such as web search engines.
Information retrieval is a well-established area that deals with many different
aspects of the information-search problem. In the typical scenario of an information-
retrieval system, users search for information in a collection of documents, which
is assumed to be “static” and “flat.” In contrast, many of the emerging informa-
tion systems of interest, such as peer-to-peer networks, blogs, social networks, and
social tagging systems among others, exhibit additional structure and have a num-
ber of properties that make the search problem very different than in the typical
information-retrieval setting of plain document collections.
A very important and almost ubiquitous characteristic of emerging information
systems is that the data can be abstracted by considering an underlying graph and as-
suming that content is residing on the nodes of this graph. The most typical example
is the web graph, which is formed by hyperlinks between web documents. Another
example is a social-network graph, the users of which are implicitly or explicitly
connected to each other. In the case of a social network the content information
consists of the user profiles, the actions of the users in the network, etc. When a
user of a social network searches for information in the network, we would like to
take into account the underlying graph structure, as well as the content (profile) of
the user and his/her neighbors. Using such additional information the search results
can be fine-tuned to the preferences and information needs of the users. As another
example, in a peer-to-peer network, the data are distributed in an overlay network
of autonomous users who may join or quit the network at will. As the available
information is prone to adversarial manipulation, identifying authoritative sources
of information is very important, as well as building mechanisms that quantify the
reputation of the data publishers. In cases such as blogs and news, the available
information is highly dynamic and becomes obsolete very fast. In other cases, in
addition to the document structure and graph structure of the data, tags are also
available and can be used to guide the search process.
In this chapter we survey algorithms for searching information in systems char-
acterized by the features we described above: presence of additional graph structure,
availability of social or other context, social tagging systems, vulnerability to adver-
sarial manipulation, high dynamicity, and more. We call such algorithms algorithms
for next-generation search.
Our survey is organized as follows. We start by reviewing traditional information-
retrieval and web-retrieval systems. We provide a brief description of the now classic
PageRank and HITS algorithms that are used for ranking information by determin-
ing authoritativeness scores. We then discuss distributed search algorithms and how
to obtain ranking scores for the case of peer-to-peer networks. We next turn into
searching in social networks; we review the seminal work of Kleinberg on navi-
gating in small-world graphs, as well as searching in social tagging systems. We
conclude our survey with additional topics, such as searching in highly dynamic
information systems, and searching with context information.
16 Next Generation Search 375
Early web-search engines retrieved documents only on the basis of text. As more
information was accumulated in the web, text-based retrieval became ineffective
due to what Jon Kleinberg called ”the abundance problem” [46]. The “abundance
problem” occurs when a search for a query returns millions of documents all con-
taining the appropriate text, and one has to select the documents of highest quality
and the ones that will better address the information needs of a user. Search engine
algorithms had to evolve in complexity to handle the problem of over abundance.
The first generation of web-search engines, which appeared in the mid-1990s, were
based solely on traditional information-retrieval methods. A formal characterization
of information-retrieval models is given by Baeza-Yates and Ribeiro-Nieto [9].
N
idft D log ;
dft
where dft is the document frequency of t, that is, the number of documents in the
collection that t appears at least one time. As a result, the idf of a rare term is high,
whereas the idf of a frequent term is low. The two factors, tfdt and idft , are combined
in the tf-idf weighting scheme in order to obtain a measure of the importance of a
term in a document:
tf-idfdt D tfdt idft :
The above weighting scheme can be used to obtain the well-known tf-idf ranking
function. Assuming that a user query q consists of a number of terms and the task is
to rank documents in order of their relevance to the query, the tf-idf ranking score
of a document d with respect to q is defined to be the sum of the tf-idf scores of the
query terms X
score.d; q/ D tf-idfdt :
t 2q
The higher the score of a document, the more relevant the document is considered
as an answer to the user query, and the final ranking to the query is produced by
ordering the documents in descending score order.
Many variants of tf-idf, involving numerous normalization and scaling schemes
for documents and queries, have been proposed in the literature [9]. In recent
years, with the increased importance of the web-search problem, researchers have
put a lot of emphasis in learning ranking functions using machine-learning al-
gorithms [20, 32, 61]. Such machine-learning algorithms fit ranking functions on
training data that contain relevance assessments provided by human editors. The
benefit of the machine-learning approach is that it can be used to obtain functions
over an extensive number of features that go well beyond tf and idf. Such features
may include link-structure information, as explained in the next sections, anchor-
text information, linguistic characterization of terms, importance of the domain
name of a document, spam scores, and possibly more. Combining such a large
number of features with a ranking function with ad hoc methods is a daunting task.
A query typically represents an information need of the user. A document is con-
sidered relevant to the user query, if the user perceives it as containing information
of value with respect to their information need. To assess the effectiveness of an IR
model, the measures of precision and recall are the ones most typically used.
Precision: The fraction of documents returned from the IR model that are consid-
ered relevant.
Recall: The fraction of the relevant documents in the whole collection that are
returned from the IR model.
The goal is to build an IR system that has both precision and recall as close to
1 as possible. Notice the trade-off between the two measures: higher score for one
16 Next Generation Search 377
measure can be achieved at the expense of the other. For instance, a recall value of 1
can be achieved by returning all the documents in the collection, but obviously such
a method will have extremely low precision.
The measures of precision and recall, as defined above, are used for evaluating
unranked sets. For evaluating ranked sets many other measures are used, such as
mean average precision (MAP), precision at k, the Receiver Operating Character-
istic (ROC) curve, discounted cumulative gain (DCG), and normalized discounted
cumulative gain (NDCG). For definitions and detailed discussion the interested
reader is referred to the Information Retrieval textbook of Manning, Raghavan, and
Schütze [56].
Metrics and models that are based on estimating the relevance of a document to the
query using only the text of the documents are not sufficient to address the abun-
dance problem. An elegant solution to the abundance problem is to favor documents
that are considered to be authoritative. One way to estimate the authoritativeness of
each document in the collection is to use the link-structure of the web graph: in par-
ticular, every hyperlink between two pages in the web graph can be considered as an
implicit endorsement of authority from the source document toward the target doc-
ument. Two different applications of this idea have led to two seminal algorithms in
web-search literature: the PageRank algorithm by Page and Brin [19], and the H ITS
algorithm by Jon Kleinberg [46].
16.2.2.1 PageRank
The PageRank algorithm models the behavior of a “random surfer” on the web
graph, which is the graph that has as nodes the web documents, and has a di-
rected edge between two nodes if there is a hyperlink between the corresponding
documents. The surfer essentially browses the documents by following hyperlinks
randomly. More specifically, the surfer starts from some node arbitrarily. At each
step the surfer proceeds as follows:
With probability c an outgoing hyperlink is selected randomly from the current
document, and the surfer moves to the document pointed by the hyperlink.
With probability 1 c the surfer jumps to a random page chosen according to
some distribution, typically the uniform distribution.
The value Rank.i / of a node i (called the PageRank value of node i ) is the fraction
of time that the surfer spends at node i . Intuitively, Rank.i / is considered to be a
measure of importance of node i .
PageRank is expressed in matrix notation as follows. Let N be the number of
nodes of the graph and let n.j / be the out-degree of node j . Denote by M the
378 D. Donato and A. Gionis
1
square matrix whose entry Mij has value n.j if there is a link from node j to
1 /
node i . Denote by N the square matrix of size N N that has all entries equal to
1
N
and which models the uniform distribution of jumping to a random node in the
graph. The vector Rank stores the PageRank values that are computed for each node
in the graph. A matrix M 0 is then derived by adding transition edges of probability
1c
N
between every pair of nodes to include the case of jumping to a random node
of the graph.
1
M 0 D cM C .1 c/
N
Since the PageRank process corresponds to computing the stationary distribution of
the random surfer, we have M 0 Rank D Rank. In other words, Rank is the principal
eigenvector of the matrix M 0 , and thus it can be computed by the power-iteration
method [19].
The notion of PageRank has inspired a large body of research on many differ-
ent aspects, both on designing improved algorithms for more efficient computation
of PageRank [25, 44, 55] and for providing alternative definitions that can be used
to address specific issues in search, such as personalization [29], topic-specific
search [16, 36], and spam detection [12, 35].
One disadvantage of PageRank is that it is prone to adversarial manipulation.
For instance, one of the methods that owners of spam pages use to boost the ranking
of their pages is to create a large number of auxiliary pages and hyperlinks among
them, called link-farms, which result in boosting the PageRank score of certain tar-
get spam pages [12].
16.2.2.2 HITS
The H ITS algorithm, proposed by Jon Kleinberg [46], introduced the paradigm of
hubs and authorities. In the H ITS framework, every page can be thought of as having
a hub and an authority identity. There is a mutually reinforcing relationship between
the two: a good hub is a page that points to many good authorities, while a good
authority is a page that is pointed to by many good hubs.
In order to quantify the quality of a page as a hub and as an authority, Klein-
berg associated every page with a hub and an authority score, and he proposed the
following iterative algorithm: Assuming n pages with hyperlinks among them, let
h and a denote n-dimensional hub and authority score vectors. Let also W be an
n n matrix, whose .i; j /-th entry is 1 if page i points to page j and 0 otherwise.
Initially, all scores are set to 1. At each iteration the algorithm updates sequentially
the hub and authority scores. For a node i , the authority score of node i is set to
be the sum of the hub scores of the nodes that point to i , while the hub score of
node i is the authority score of the nodes pointed by i . In matrix-vector terms this is
equivalent to setting h D W a and a D W T h. A normalization step is then applied,
so that the vectors h and a become unit vectors. The vectors a and h converge to the
principal eigenvectors of the matrices W T W and W W T , respectively. The vectors
a and h correspond to the right and left singular vectors of the matrix W .
16 Next Generation Search 379
Given a user query, the H ITS algorithm determines a set of relevant pages for
which it computes the hub and authorities scores. Kleinberg suggested to obtain
such a set of pages by submitting the query to a text-based search engine. The pages
returned by the search engine are considered as a root set, which is consequently
expanded by adding other pages that either point to a page in the root set or are
pointed by a page in the root set.
Kleinberg showed that additional information can be obtained by using more
eigenvectors, in addition to the principal ones. Those additional eigenvectors cor-
respond to clusters or distinct topics associated with the user query. One important
characteristic of the H ITS algorithm is that it computes page scores that depend
on the user query: one particular page might be highly authoritative with respect
to one query, but not such an important source of information with respect to an-
other query. On the negative side, having to compute eigenvectors for each query
makes the algorithm computationally demanding. In contrast, the authority scores
computed by the PageRank algorithm are non-query-sensitive, and thus, they can be
computed in a preprocessing stage.
With the extremely fast pace that characterizes the growth in the web, a key chal-
lenge in modern information retrieval is the development of distributed tools that are
able to allocate the computational load and divide the storage requirements among
different machines. The benefits of such a distributed framework are straightfor-
ward: more computational power devoted to each single task, availability of larger
storage, and the ability to scale down operational costs.
Peer-to-Peer (P2P) networks have recently been a niche paradigm for distributed
computation and file sharing. In P2P networks a large amount of peers are acting
as both clients and servers. In addition to all advantages of distributed systems, P2P
networks ensure a number of further features such as self-organization, symmet-
ric communication, distributed control, and robustness. Search algorithms for P2P
methods have been the subject of an increasing number of research work. Topics
of interest include P2P indexing methods, distributed hash tables, distributed rank-
ing, query optimization over P2P indexes, and more. A good survey of P2P search
methods can be found in [63].
Algorithms for distributed search are also very important for designing central-
ized web-search engines. In the case of web-search engines, distributed information
retrieval tasks are accomplished using a number of modules devoted to crawling the
web, indexing the collected data, computing page-quality assessment scores, and
processing the user queries. The centralized approach presents several disadvan-
tages in terms of scalability, updating speed, capacity, and maintenance costs [8,69].
Recently Baeza-Yates et al. [8] provided an in-depth discussion regarding the main
characteristics and the key issues concerning the three search modules: crawling,
indexing, and querying. The authors of [8] emphasize the main challenges and open
380 D. Donato and A. Gionis
problems that modern search engines are facing in order to cope with the immense
amount of data generated continuously by users and in order to guarantee high qual-
ity of results, fast response time, and high query throughput.
In this section we concentrate on the problem of assigning a quality-assessment
value to each page in a distributed environment. The challenge to this problem arises
from the fact that relevance-based measures, such as PageRank, depend on the over-
all structure of the underlying graph, while in a distributed setting each module
accesses only a subset of the data. For instance, the PageRank value of a particular
node collects the authority contributions of the nodes along all the paths ending in
this particular node [7]. Therefore, the computation of PageRank is typically ac-
complished by a central server that has access to the whole underlying graph. This
constraint for centralized access poses a potential bottleneck for the performance of
search engines and performing the PageRank computation in a distributed fashion
is a very interesting research problem.
All the solutions proposed so far for the problem of computing PageRank in a
distributed fashion follow two main directions:
Decentralized algorithms [69, 71] that start from the assumption that the un-
derlying web graph is characterized by a block structure of the corresponding
adjacency matrix. All these method are inspired by the work of Kamvar et al. [44]
that propose to exploit the topology of the web graph in order to speed up PageR-
ank computation.
Approximation algorithms that estimate the actual PageRank scores by perform-
ing the computations locally to a small subgraph [22] or at peer level [64,72] and
using message exchange to update scores through the network.
Next we present the BlockRank algorithm [44] that inspired many of the follow-
ing work on the topic and we review some of the distributed ranking algorithms that
recently appeared in literature.
16.3.1 BlockRank
Web pages are organized in web sites that are usually stored on web hosts. Empirical
analysis of portions of the web has shown that pages in the same site and the same
host are highly interconnected and loosely linked to external pages [15,44,69]. This
means that the number of intra-host and intra-domain links is much higher than the
number of inter-host and inter-domain links. As a result, ordering the URLs of the
pages (for instance, lexicographically) leads to a block structure of the adjacency
web matrix. This observation motivated Kamvar et al. to propose the BlockRank
algorithm, which is based on aggregation/disaggregation techniques [23, 66], and
domain-decomposition techniques [31]. Even though in its original formulation the
BlockRank algorithm was not intended to work in a distributed manner, its simple
idea inspired many distributed ranking methods.
16 Next Generation Search 381
The BlockRank algorithm is summarized as follows. The details of each step are
discussed subsequently.
1. Split the web pages into blocks by domain.
2. Compute the local PageRank vector lJ for the pages within each block J .
3. Estimate the relative importance scores, or BlockRank scores, for each block.
4. Weight the Local PageRank of a page in each block by the BlockRank of the
corresponding block and aggregate the weighted Local PageRank to form an
approximate Global PageRank vector z.
5. Use z as a starting vector for standard PageRank.
The local PageRank vector lJ is computed by considering only the links within
the block J . It has been empirically observed that local PageRank preserves the
relative ranking with respect to the global one. This observation suggests using as
stopping criteria the Kendall’s
[45] between the rankings obtained in two consecu-
tive iterations of the algorithm. The computation ends when the Kendall’s
residual
is equal to 1.1
The relative importance of each block is computed using the standard PageRank
algorithm over the block graph B, a weighted directed graph, whose nodes corre-
spond to blocks and an edge between two nodes is present if there is a link between
two pages in the corresponding blocks. The edges from a block I to a block J are
weighted by the sum of the local PageRank values of the pages of I pointing to J .
Formally, the block matrix B is the k k matrix
B D LT AS;
where L is the n k matrix whose columns are the local PageRank vectors lJ , A
is the adjacency matrix of the global graph, and S is a k n matrix with the same
structure of L, but whose nonzero entries are replaced by 1.
The global PageRank value of a page j 2 J is approximated by its local PageR-
ank score, weighted by the BlockRank value of the block J that the page belongs to.
The main advantage of the BlockRank algorithm is that the majority of the host
blocks can fit in main memory with an effective speed up of the overall performance.
The extension of the algorithm to distributed environments is straightforward. The
main limitation is clearly due to the need of knowing the overall structure of the
network for the BlockRank computation.
1
We remind the reader that the Kendall’s
distance between two rankings r1 and r2 on n items is
defined to be the fraction of item pairs .i; j / for which the two rankings disagree
P
i;j Kfi;j g .r1 ; r2 /
KDist D
n.n 1/=2
where Kfi;j g .r1 ; r2 / is equal to 1 if i and j are in different order in r1 and r2 and 0 otherwise.
382 D. Donato and A. Gionis
Next we discuss some of the algorithms that have been recently proposed in the
literature for computing PageRank scores in a distributed setting. Far from being
an exhaustive survey of the topic, this section aims to illustrate some of the current
research directions.
16.3.2.1 ServerRank
16.3.2.2 SiteRank
Wu and Aberer [71] propose to study the web graph at the granularity of web sites,
instead of considering web pages. Their algorithm, which they called SiteRank,
works in a similar way to ServerRank but the motivation behind the two algorithms
is different. ServerRank is proposed as a solution to the problem of ranking in a
distributed fashion sets of pages that are physically stored on different servers. Site-
Rank, instead, studies the rank of each single web site whose pages might belong or
not to different servers. The algorithm performs the traditional PageRank computa-
tion on the site graph, i.e., the graph whose nodes are sites and edges are hyperlinks
among them.
16 Next Generation Search 383
One of the interesting results of Wu and Aberer is that the SiteRank distribution
still follows a power law with coefficient equal to 0.95. Moreover their experiments
verify the existence of a mutual reinforcement relationship between SiteRank and
PageRank:
1. Most pages of important web sites are also important.
2. If a web site has many important pages, it is highly probable that it is an important
site.
This empirical observation justifies the choice of algorithms in this family to weight
local PageRank scores with block/server/site PageRank scores in order to obtain
global authority scores for web pages. As in the case of ServerRank, the computation
of a global ranking for web documents using a decentralized architecture for search
systems [64] requires three steps:
1. The computation of the SiteRank vector rS .
2. The computation of a local score vector rL is given by the sum of local PageRank
score rI and a correction vector rE . The coordinate of the vector rE that corre-
sponds to document d is defined as
X outd .Ii .d //
d
rE D rS .vd /;
NI.d /
I.d /
where I.d / is the set of sites with at least one page pointing to d , outd .Ii .d / is
thePnumber of pages that point to d within the site Ii .d / 2 I.d /, NI.d / is equal
to I.d / outd .Ii .d /, and rS .vd / is the rank value the web site that the document
d belongs to.
The local score for each document is given by rL D wI rI C wE rE . In [71],
the authors choose .wI ; wE / D .0:2; 0:8/ in order to give more importance to
external links as opposed to internal due to the relatively low number of links
across web sites as compared to the number of links within the same web site.
3. The application of a new operator, i.e., the Folding Operator [1], to combine both
rankings into a global authority score.
Most of the techniques proposed in literature are based on the assumption that the
web can be partitioned in almost-disjoint partitions that correspond to pages belong-
ing to the same hosts or domains. In this case, the link matrix has a block structure
that can be exploited to speed up PageRank computation: PageRank scores are com-
puted locally within each block and the results are combined.
In a P2P scenario, the above assumption that the link matrix has a block structure
does not hold. The reason is that peers crawl portions of the web in a completely
decentralized way without any control over overlaps between local graphs. As a
consequence pages might link to or be linked by pages at other peers. Moreover,
384 D. Donato and A. Gionis
peers have only local (incomplete) information about the global structure of the
graph and this makes it impossible to merge local scores into global ones.
The JXP algorithm [72] is a distributed algorithm that addresses the two main
limitations described above: it allows overlap among the local networks of peers and
it does not require any a priori knowledge of the content of the other peers. It runs
locally at every peer, and combines these local authority scores with the information
obtained from other peers during randomly occurring meetings. It has been proved
that the scores computed by the JXP algorithm, called JXP scores, converge to the
true global PageRank scores.
As mentioned before, peers know exclusively the pages in their local crawl, nev-
ertheless these pages might link to or be linked by external pages belonging to other
peers’ networks. In order to perform local computation taking into account external
links, a special node, called World Node, is added to each local graph. The world
node represents all the pages in the network that do not belong to local graph, and
all the links from local pages to external pages point to the world node. In the first
iteration of the algorithm, each peer knows only the links that point to the world
node, but it does not have any information about links from external nodes and,
hence, from the world node. The main phase of the algorithm consists of two iter-
ative steps: (i) the local computation of PageRank scores over the local graph with
the addition of the world node; (ii) a meeting phase between pairs of peers in which
the local graphs are merged into a new one.
During the first step of local PageRank computation the authority score of the
world node represents the sum of all the scores of the external nodes. In the second
step, i.e., the meeting phase, each peer exchanges the local authority scores and
discovers external links that point to its internal pages. After the meeting phase all
the information related to the other peers is discarded with the only exception of
the transition probabilities from external incoming links that are stored in the link
matrix. In this way, the outgoing links of the world node are weighted to reflect the
score mass given by the original links. It is worth noting that the transitions among
external pages have to be taken into account during the PageRank computation. For
this reason a self-loop link is added at the world node.
The two phases, described before, are summarized as follows:
Initialization Phase
1. Extend the local graph by adding the world node.
2. Compute the PageRank in the extended graph.
Main Algorithm (for every peeri in the network)
1. Choose peerj at random.
2. Merge the local graphs by forming the union of pages in the two peers and the
edges among them (world nodes are combined as well).
3. Compute PageRank in the merged graph.
4. Use the PageRank scores in the merged graph to update the JXP scores of the
pages in the two peers.
5. Update the local graphs: Reconstruct world nodes for the two peers based on
what was learned in the meeting phase.
16 Next Generation Search 385
At a peer meeting, instead of merging the graphs and world nodes, one can simply
add relevant information received from the other peers into the local world node, and
perform the PageRank computation on the extended local graph. It can be proved
that the JXP scores still converge to the global PageRank scores.
The JXP authority vector is proved to converge to the global PageRank, i.e., to
the global stationary distribution vector associated to the global transition matrix
CN N . The analysis is accomplished on the light-weight merging version of the
algorithm and it is based on the assumption that the number of nodes N in the graph
is fixed.
The proof builds on the theory of state aggregation in Markov chains [23, 58].
Let 0 1
p11 : : : p1n p1w
B :: :: :: C
B : C
P D B : ::: : C
@ pn1 : : : pnn pnw A
pw1 : : : pwn pww
be the local transition matrix associated with each extended local graph G (that is,
the last row and last column correspond to the world node), where:
(
1
out .i /
if there exists a transition from i to j
pij D
0 otherwise
and
X 1
pi w D
out.i /
i !r
r…G
for every i; j , 1 i; j n.
The transition probabilities from the world node, pwi and pww , change during the
computation, so they are defining according to the current meeting t
X ˛.r/t 1
t
pwi D t 1 (16.1)
out.r/ ˛w
r!i
r2W t
X
n
t t
pww D1 pwi (16.2)
i D1
386 D. Donato and A. Gionis
Moreover let
T
˛ D ˛1 : : : ˛n ˛w
be the local stationary distribution, i.e., the JXP scores.
The following two theorems describe important properties of the JXP scores.
Theorem 16.1 ([72]). The JXP score of the world node, at every peer in the net-
work, is monotonically non-increasing.
Theorem 16.2 ([72]). The sum of scores over all pages in a local graph, at every
peer in the network, is monotonically non-decreasing.
These above-mentioned properties allow to relate JXP scores with the global
PageRank scores. The next Theorem states that the global PageRank values are an
upper bound for the JXP scores.
Theorem 16.3 ([72]). Consider the true stationary probabilities (PageRank scores)
of pages i 2 G and the world node w, i and w , and their JXP scores after t
meetings ˛it and ˛wt . The following holds throughout all JXP meetings:
0 < ˛it i for i 2 G and w ˛wt < 1.
Theorem 16.3 shows that the algorithm never overestimate the correct global
PageRank scores and, as direct consequence, using the notion of fairness [54] the
convergence of JXP toward the true PageRank scores can be proved:
Theorem 16.4 ([72]). In a fair series of JXP meetings, the JXP scores of all nodes
converge to the true global PageRank scores.
Services of social networking, which are also known as social media, are based on
people forming a network with their friends and their acquaintances and publishing
their own content. Nowadays the web is dominated by a large number of appli-
cations tailored for social interaction including, blogs, wikis, social bookmarking,
peer-to-peer networks, and photo/video sharing. The dramatic growth in popularity
of such social networking sites is changing radically the way people communicate
and exchange information. Naturally, the study of social networks is drawing a lot
of attention in the research community; since the existing web-search algorithms
are not designed for these social environments, searching and ranking information
in social networks is an emerging area of research, too.
Social search is motivated by searching for information relevant to one’s social
interactions, profile, preferences, or communities of interest. In some cases, users
are searching in order to find the right person for a professional collaboration, for
file sharing, or just for chatting. In other cases, users are searching for an expert who
can solve a practical problem or give an opinion on different questions, ranging from
product reviews to personal relationships.
16 Next Generation Search 387
With respect to the “first generation” web content, social media is characterized
by more heterogeneous data integrating user-generated content, user connections,
ratings, comments, and more. As mentioned by Amer-Yahia et al. [6] all the activ-
ities that users perform in social-content sites can be seen as “filtering of resources
in communities by various search criteria”. Thus, services of social networking may
help users in creating and expanding their social circle, and for such a task, effective
algorithms for social search can be essential.
An idea related to social search is the concept of social web search, which is
based on social tagging, a paradigm that lets users rate and tag web pages or other
web content. Social web search is motivated by the intuition that users could find
more easily what they were looking for, based on the collective judgment of all other
users. This idea has inspired sites such as MySpace, and del.icio.us that allow users
to discover, organize, and share web content.
In 1967, Milgram [59] tested the existence of short paths inside a social network.
In the famous experiment he conducted, he asked randomly chosen individuals in
the USA to pass a chain letter to one particular person living in a suburb of Boston.
The participants of the experiment had to forward the letter they received to a sin-
gle acquaintance whom they knew on a first-name basis. While only around 29%
of the initial number of letters found their target, the median length of the com-
pleted paths was only six. In fact, one can observe that not only short paths exist in
social networks among essentially arbitrary pairs of nodes, but also, perhaps even
more surprisingly, that people were able to discover and navigate through those
short paths.
Kleinberg took up the challenge to study the design of decentralized search al-
gorithms that can be used to find so effectively a target in a social network [47, 48].
He suggests a simple model, a variant of the small-world model of Watts and p Stro-
gatz [70], in which n nodes are arranged in a regular square lattice (of side n).
However, instead of adding long-range links uniformly at random, Kleinberg sug-
gests to add a link between two nodes u and v with probability proportional to r ˛ ,
16 Next Generation Search 389
where r is the distance between u and v in the square lattice. The parameter ˛ con-
trols the extent to which the long-range links are correlated with the geometry of the
underlying lattice. For ˛ D 0 the long-range links are random, while as ˛ increases
there is stronger bias for linking to nearby nodes. The case of ˛ D 2, matching the
dimension of the lattice, is important because it produces a number of links within
an area from a source node that are approximately proportional to the radius of the
area.
Given the coordinates of a target node in the above lattice-based network, a nat-
ural decentralized algorithm for passing a message to that node is the following:
each node forwards the message to a neighbor – long-range or local – whose lattice
distance is as close to the target node as possible. Kleinberg showed that there is a
unique value of ˛ for which the above “greedy” algorithm achieves a polylogarith-
mic delivery time.
Theorem 16.5 ([47, 48]).
(a) For 0 ˛ < 2 the delivery time of any decentralized algorithm is ˝.n.2˛/=3 /.
(b) For ˛ D 2 the delivery time of above greedy algorithm is O.log2 n/.
(c) For ˛ > 2 the delivery time of any decentralized algorithm is ˝.n.˛2/=.˛1/ /.
The theorem shows a threshold phenomenon in the behavior of the algorithm with
respect to the value of the parameter ˛. For values ˛ < 2 the long-range links are
too random while for ˛ > 2 they are too short to guarantee creating a small world.
More generally, for networks built on an underlying lattice in d dimensions, the
optimal performance of the greedy algorithm occurs for the value of the parameter
˛ D d.
Kleinberg generalizes the decentralized search algorithm for other models, in-
cluding a hierarchical variant of a small-world graph [49]. According to this model,
the nodes of the network are organized on a complete b-ary tree. A motivation for
this abstraction can be derived by thinking about professions or interests of people,
or about the organization within a company.
In this hierarchical model, the distance between two nodes u and v is defined
to be the height h.u; v/ of their lowest common ancestor. The construction of the
random long-range links is controlled by two parameters k and ˇ: for each node u
in the tree, outgoing edges are added by selected k other nodes v, where each such
node v is selected with probability proportional to b ˇ h.u;v/ . As in the lattice case,
the model adds long-range links in a way that nearby nodes are favored. As before,
the case of ˇ D 0 corresponds to uniformly random links, and larger values of ˇ
favor more nearby nodes.
The search algorithm for locating a target node at a given position in the graph
is as before: each node selects the node among its neighbors that is the nearest to
the target node. As an analogue of Theorem 16.5, Kleinberg showed that there is a
unique value of the parameter ˇ that polylogarithmic delivery time can be achieved.
Theorem 16.6 ([49]).
(a) In the hierarchical model with exponent ˇ D 1 and out-degree k D O.log 2 n/,
there is a decentralized algorithm with polylogarithmic delivery time.
390 D. Donato and A. Gionis
(b) For every ˇ ¤ 1 and every polylogarithmic function k.n/ there is no decentral-
ized algorithm in the hierarchical model with exponent ˇ and out-degree k.n/
that achieves polylogarithmic delivery time.
A more extensive discussion on decentralized search algorithms and additional re-
sults on more general graph models and can be found in the excellent paper of
Kleinberg [51].
Adamic and Adar [2] experiment with decentralized search algorithms on real
datasets. They use a graph derived from the email network of HP Labs, by consid-
ering an edge between two individuals if they have exchanged more than a certain
number of email messages. They also consider a friendship network dataset from
a community web site, where nodes represented students within a University cam-
pus. The average distance between all pairs of nodes in the email network was three
links, while for the friendship network it was around 3.5.
For both datasets the task was to find specific individuals in the network, by
following links. For the email network, Adamic and Adar consider three different
search strategies, described as follows: at each step the current node is contacting
its neighbor who is (i) best connected, (ii) closest to the target in the organizational
hierarchy, (iii) located in the closest physical proximity to the target. Among the
three strategies, following links in the organizational hierarchy is proven to be the
best, yielding an average path length of size 5. Adamic and Adar speculated that the
best-connected (highest degree) strategy did not work well (average path length 43)
because the actual degree distribution of the email network was Poisson and not a
power law, and in their previous work [3] they showed that the high-degree strategy
performs poorly for such networks.
The strategy of contacting the neighbor with the closest physical proximity to the
target gave an average path length of 11.7. Adamic and Adar explained this high path
length by computing the probability of two individuals being connected in the email
network as a function of their physical distance r. They found that the probability
of being connected was proportional to r 1 , instead of r 2 , which would have been
the optimal according to Theorem 16.5.
Social tagging systems, such as del.icio.us and flickr, allow users to annotate partic-
ular resources, such as web pages or images, with freely chosen sets of keywords,
also known as “tags.” Tagging can be seen as personal bookmarking, allowing the
users to keep organized and locate fast what they have already encountered, as well
as a form of social interaction, since users can be implicitly connected based on their
tagging activity and the resources they have shared.
Marlow et al. [57] discuss tagging systems with respect to their characteristic and
their possible benefits. In a recent study, Heymann et al. [37] investigated how social
tagging systems can be used to improve web search. Based on data obtained by
crawling the del.icio.us site over many months, they concluded that socially tagged
16 Next Generation Search 391
URLs are full of fresh information, and also that tags are appropriate and objective.
On the negative side, they showed that tags on URLs are often redundant given title,
domain, and page text, and that the coverage of tags is still small to have a big impact
on web search.
Gulli et al. [34] present a link-based algorithm, called TC-SocialRank, which
leverages the importance of users in the social community, the importance of the
bookmarks/resource they share, additional temporal information, and clicks, in
order to perform ranking in folksonomy systems.
Amer Yahia et al. [6] provide a survey of search tasks in social tagging systems
and they discuss the challenges in making such systems applicable. They consider
many features of social tagging systems that can be used to improve the relevance
of search tasks. Those features, in addition to the traditional features of text content,
timeliness, freshness, and incoming links, they also include tags, popularity, social
distance, and relevance to people’s interest.
Following on the ideas discussed in [6], Benedikt et al. [13] have recently pre-
sented an in-depth study of a top-k search problem for social tagging systems. The
problem formulation in [13] is as follows. Users have the dual role of taggers and
seekers: taggers are tagging items (resources) and seekers are searching for those
items using keyword search. A social network is defined among users, which can be
based on explicitly declared contacts or on evaluating a similarity measure among
the users. Thus, for a user u we define by N.u/ to be the set of neighbors of u in
the social network. A seeker u searches for information, and the score of an item i
for the user u depends on the keyword-search of u and the number of users in N.u/
who have tagged item i . Therefore, the score of each item can be different for each
user based on the social connections of u and the tagging activity of the neighbors
of u. The problem is to find the top-k items to return to a user query.
Solving the above top-k search problem can be a very inefficient process. First,
if inverted indices are to be used, as in the classic paradigm of information retrieval,
one inverted index is needed for each (tag, seeker) pair (instead of one inverted index
per tag). This is because the score of each item with respect to a tag depends on the
particular seeker who searchers for this tag. Such a solution, however, requires a
prohibitively large amount of space. Benedikt et al. [13] show how to obtain an
efficient solution for this top-k search problem while keeping the amount of space
required manageable. The idea is to use a rank combination algorithm, such as the
threshold algorithm (TA) [28]. The difference with the standard application of the
TA algorithm, is that instead of maintaining exact scores for each item in the inverted
list of a tag, we need to keep an upper bound that depends on all possible seekers.
The tighter the upper bound the more efficient the rank combination algorithm will
be. Benedikt et al. [13] propose clustering the users based on their behavior (either as
seekers or as taggers), in order to obtain tighter upper bounds and, thus, improving
the performance of their top-k algorithm.
392 D. Donato and A. Gionis
A large part of the web can be described as a static set of pages that contain in-
formation, which does not change so often. Such a view of the web resembles an
encyclopedia or a University library. In addition to this static component, however,
there is a large volume of activity in the web that is described by a very large degree
of dynamicity: new information is posted continuously, people discuss at real time
and exchange opinions, and information becomes obsolete very rapidly. This part
of the web is referred to as the live web, and two of its most important aspects are
blogs and News.
The large volume of user-generated content has significantly changed the web
paradigm in the last 5 years: users do not simply consume content produced by pro-
fessional publishers or other users but they constantly offer their own contribution in
the form of opinions, comments, and feedback, building an extremely complex net-
work of social interactions. An important role in this radical change has been played
by Blogs (an abbreviation for weblogs). Blogs are online journals regularly updated
by users in order to share ideas and comment on topics of interest for them. The
key to their success has been the immediate accessibility to a large audience. The
last item, inserted by the blog owner, is visualized at the top of the page, capturing
the attention of readers who are allowed to post comments and feedback. The type
of content posted is extremely rich and vary from text to multimedia, with a high
number of links toward external pages and other blogs. The term Blogosphere in-
dicates the totality of blogs and the interconnections among them. The blogosphere
has been continuously growing as reported in Technorati [43] in its quarterly State
of the Blogosphere reports [42]. In April 2007, Technorati was tracking 70 million
blogs and observed a rate of 120; 000 new blogs created worldwide each day. With
respect to posting volume the measure is about 1:5 million postings per day, even
though, the rate of growth of postings is not as fast as the rate of growth of new
blogs created.
With respect to search algorithms, there are a number of factors that make search-
ing the Blogosphere different than traditional web search:
Temporal Connotation Each post is associated with the date and the time it was
created.
Geographical Locality Each blog should be placed in the context of the geograph-
ical location of its author.
Heterogeneous Information The utility of the information in the blogosphere can
be significantly increased if it is cross-referenced with public information in other
16 Next Generation Search 393
information networks, such as the web, user profiles, and social networks, for exam-
ple, Bhagat et al. [14] present a study on how different networks collide and interact
with each other.
High Dynamics Blog posts attract comments of other bloggers, leading to discus-
sions among different members of the Blogosphere.
The previous considerations propel the need for focused technology and special-
ized search engines as Technorati [43], Ice Rocket Blog Search [40], BlogPulse [41],
Ask Blog Search [38], and Google Blog search [39].
Applying standard web-search engines for searching the Blogosphere can be ex-
tremely ineffective, since standard web-search engines are not designed to cope
with the characteristics of blogs. Algorithms for searching blogs should be able to
exploit the temporal and geographical dimensions in order to discover valuable in-
formation not directly associated with search keywords. As suggested by Bansal and
Koudas [10], such additional information is easily inferable from keywords whose
popularity is strongly correlated to the search term popularity along the time axis.
This objective requires integrating search engines with methods and algorithms for
(i) temporal burst detection of term popularity, (ii) efficient discovery of correlated
set of keywords, and (iii) monitoring hot keywords.
Temporal Burst Detection It has been observed that many forms of temporal data
that reflect human-generated content, including emails [50], blogs posts [10, 53],
and news streams [26], are characterized by a bursty structure.
Kleinberg [50] studied the problem of modeling the temporal behavior of an
email containing particular keywords (such as “NSF grant”). He observed that such
behavior can be characterized by bursty activity: fragments of low and high activity
are interleaved. In fact, a more accurate description can be obtained by a hierarchical
model: fragments of high activity contain other fragments of even higher activity,
and so on. Kleinberg described a generative model that generates such bursty activ-
ity and given an input sequence described a dynamic programming algorithm that
finds the maximum-likelihood parameters of the model that describe the data. A
sequence of events is considered bursty if the fraction of relevant events alternates
between periods in which it is large and long periods in which it is small. Kleinberg
defines a measure of weight associated with each such burst and solves the problem
of enumerating all the bursts by order of weight.
Another model for detecting bursty activity is described by Bansal and
Koudas [10], who apply their model on queries submitted to blog sites. Bansal
and Koudas express the popularity x of a query as the sum x D C N.0; 2 /
where is a base popularity and a N.0; 2 / is a Gaussian random variable with
zero mean and variance 2 .
Given a temporal window of the last w days, the popularity values x1 ; x2 ; : : : ; xw
for each query x are monitored. These values are then used to estimate the parame-
ters and of the model via the maximum likelihood estimation [5]:
1X 1X
w w
D xi ; and 2 D .xi /2 :
w w
i D1 i D1
394 D. Donato and A. Gionis
Since Bansal and Koudas [10] observe that less than the 5% of the xi values are
greater than C 2, such a value is taken as threshold for burst detection: the i th
day is a burst if the popularity value is greater than C 2.
Discovery of Correlated Keywords Potential correlation between blog posting
terms and blog query terms may provide an explanation for the observed similarities
of bursty activity between the two. However, finding all the terms correlated to the
query q is not a simple task. One of the possible definitions of correlation c.t; q/
between a term t and a query q is by using the conditional probability of having
a term t in a particular document d , given the query q, across several temporal
granularities.
P .t 2 D j q 2 D/
c.t; q/ D :
P .t 2 D/
An efficient solution to this problem was proposed recently by Bansal and
Koudas [10]. Here the correlation c.t; q/ between the term t and the query q is
measured in terms of the score s.t; q/ of term t with respect to the query q, which
is defined as follows:
where D is a document in the set Dq of all the documents that contain the query q
and idf.t/ is the inverse document frequency of t in all documents D. That is, the
score s.t; q/ is the number of the documents that contain the term t among the set
of documents that are relevant to the query q. The factor idf.t/ is used, as usual, to
decrease the score of very frequent words such as articles, prepositions, and adverbs.
The top-k terms in Dq are returned as correlated terms with respect to q.
Computing the above measure requires a single scan on the set of documents
Dq ; however, this can still be too expensive if the total number of documents in Dq
is large. To avoid this computational overhead a solution proposed by [10] was to
approximate the score s.t; q/ on a sample of documents.
It can be observed that s.t; q/ is proportional to c.t; q/ in expectation, since,
taking as idf.t/ the ratio between the total number of documents D and the set of
documents Dt that contains t
Burstiness measures the deviation from the mean value t of the popularity xi
observed over a temporal window of w days before the current one. Given the term
t, the burstiness can be computed as xt
t
t
.
Surprise measures the deviation of popularity from the expected value r.x t /
computed by a regression model over a temporal window of w days. It is given
t tj
by jr.x/x
t . The global ranking can be efficiently computed, maintaining a pre-
computed list of the term frequencies.
The EigenRumor Algorithm Fujimura et al. [30] observe that the number of in-
links to individual blog entries is very small in general, and a large amount of time
is needed to acquire inlinks. Such a lack of inlinks makes it difficult for blog posts to
acquire meaningful authority values. To overcome this problem and to obtain better
authority scores for ranking blog postings, they propose the EigenRumor algorithm,
which is a variant of the H ITS algorithm [46].
Fujimura et al. [30] consider a graph G D .Va [Vp ; Ea [Ec /, where Va is the set
of vertices representing users (or agents), and Vp is the set of vertices representing
posts (or items, in general). The set of edges include Ea
Va Vp , that capture the
authorship relation, so that .a; p/ 2 Ea if a is the author of the post p. The edges
in graph G also include Ec
Vp Vp , that captures every comment or evaluation
of a previous published item, so that .p1 ; p2 / 2 Ec if post p1 is a comment to post
p2 . Ea and Ep are denoted in the framework of Fujimura and Tanimoto [30] as
information provisioning and information evaluation, respectively.
The model in [30] consists of two different matrices: the provisioning matrix
P , whose nonzero entries correspond to edges of Ea , and the evaluation matrix E
whose nonzero entries correspond to edges in Ec .
The reputation score rp of an item p and the authority score aa and hub score ha
of the author/agent a are computed as follows:
r D ˛P T a C .1 ˛/E T h;
a D P r;
h D Er:
The above equations model the following mutually reinforcing relations: (i) posts
that acquire good reputation are written by authoritative authors and are commented
by authors of good hub score, (ii) authoritative authors write posts of good rep-
utation, and (iii) authors of good hub score comment on posts that acquire good
reputation. As in the H ITS algorithm, the above system of equation is solved by
considering an arbitrary initial solution and iteratively apply the formulas until the
vectors r, a, and h converge.
Searching news articles is one of the most important online activities. According
to a recent survey of Nielsen Online, the online reach of newspapers in July 2008
increased to 40:2% of total web users, up from 37:3% the previous year.
396 D. Donato and A. Gionis
The problems related to news searching are similar to the ones discussed in the
previous section and, in particular, derive from the fact that news articles are gener-
ated in a stream fashion and usually have a quick expiring time. The principal news
search engines (Yahoo! news, Ask news, Google news, Topix) collect news arti-
cles coming from many different news sources, automatically assign them to one or
more categories and rank them according to many factors, like freshness, authori-
tativeness of the source, number of difference sources that cite such a news article,
and so on. Gulli [33] introduces a general framework for building a News search
engine, describing the architecture and the various components.
Del Corso et al. [26] propose a ranking framework which models many of the
characteristics of a stream of news articles. Their approach is the first attempt to
tackle the main aspects that make news search completely different from classical
web search. Their ranking algorithm has the following characteristics:
Using two ranking score vectors for news posting and news sources.
Detecting clusters of postings that “tell the same story.” The dimension of the
cluster centered around a single story is an indirect measure of the relevance of
the story itself.
Capturing the mutual reinforcement between news articles and news sources:
good news articles are issued by authoritative news sources, i.e., press agencies,
and vice versa.
Taking directly into account the time of each news article as an inherent indicator
of the freshness of the article: fresh news articles should be considered more
important than old ones.
Ensuring low time and space complexity to allow online computation over the
continuously incoming news stream.
Consider a news article ni appearing in time ti from a news site S.ni /. The
ranking algorithm assigns a ranking score value R.ni ; ti / to that news article, and a
ranking score article R.S.ni /; ti / to the site S.ni /. The decay of news importance
over time is modeled by an exponentially decreasing function: at any instance t > ti ,
the news article ranking score is given by R.ni ; t/ D e ˛.t ti / R.ni ; ti /, where ˛ is a
parameter of the algorithm that depends on the category which the news belongs to.
The initial news article ranking score R.ni ; ti / takes into account two main factors:
The authoritativeness of its source S.ni / at the instant immediately before the
emission time ti
The ranking scores of similar news articles previously emitted whose importance
has already been multiplied by the exponential factor
The global formula for the initial rank value of a news is given by:
ˇ X
R.ni ; ti / D lim R.S.ni /; ti
/ C e ˛.ti tj / ij R.nj ; tj /ˇ ;
!0C
nj jtj <ti
where 0 < ˇ < 1 is a parameter introduced to put the source rank in relation with
its emission frequency (see [26] for further details) and ij is a similarity measure
of the news articles ni and nj that depends on the clustering algorithm used.
16 Next Generation Search 397
The ranking score of a news source sk is then given by the ranking scores of the
articles generated in the past from this source, plus a term of the ranking scores of
news articles similar to those issued from sk and posted later on by other sources.
Following the same intuition that holds for the ranking scores of news article, both
these factors are weighted by the decay factor e ˛.t ti / . The final equation for news
sources is as follows:
X X X
R.sk ; t/ D e ˛.t ti / R.ni ; t/C e ˛.t ti / ij R.nj ; tj /ˇ :
S.ni /Dsk S.ni /Dsk
tj > ti
S.nj / ¤ sk
The complexity of the algorithm is linear in the number of articles whose ranking
score by time t is still high enough to be considered. This allows a continuous on-
line process of computing ranking scores. Moreover the set of experiments described
in [26] show the robustness of this approach with respect to the range of variability
of the parameters ˛ and ˇ.
Consider the scenario in which a user reads or browses a particular web page and she
wants to find more information about a particular term or a phrase in the page. For
instance, a user wants to find more information about “chili peppers” while reading
a page, and the answer might depend on whether this page is about music or about
food. In other words, the information about the current page provides the context to
the user query.
Von Brzeski et al. [68] and Kraft et al. [52] introduce the problem of searching
the web with contextual information. In [52], they propose three algorithms that
leverage existing technology of search engines. The first algorithm identifies infor-
mative terms from the context page and adds those terms in the query, and then it
submits the rewritten query to a search engine. The second algorithm is a “softer”
version of the first algorithm, in which the terms of the context are not just added to
the query but they are used to bias the query result. The third algorithm is based on
the idea of rank aggregation: the context page is used to rewrite many queries and
the results of those queries are combined by using rank aggregation.
Ukkonen et al. [67] describe a framework for context-sensitive search, in which
the underlying link structure of pages is taken into account. In their framework a
graph G D .V; E/ is used to represent entities containing textual information and
links among the entities. For example, nodes can represent document and edges
can represent hyperlinks among the documents, however, other interpretations are
possible depending on the application. A query is defined by a pair hq; pi, where q
is a set of terms and p 2 V is a node of G that defines the context of the query. They
suggest a number of text-based features from the query node p that can be used for
learning a ranking function, as well as features that are based on the structure of the
graph G.
398 D. Donato and A. Gionis
16.6 Conclusions
Search algorithms are continuously adapting to engage the challenges that have
been posed by the web evolution. Scientists are called to address the research ques-
tions that has recently been created by the transition from web 1:0 to web 2:0, with
users more and more active in delivering content. Search engines are forced to cope
with the extremely rich and heterogeneous datasets that are being created by blogs,
tagging systems, social communities, media-sharing sites, and news. Such hetero-
geneity has caused the proliferation of special-purpose search engines, such as blogs
and news search engines that require ad hoc algorithms for collecting, indexing, and
ranking different items.
There are still many open problems to be addressed in next-generation search
algorithms, and furthermore new paradigms and alternate forms of communication
(e.g., mobile networks) are emerging, which will create more research directions.
At this point it is very difficult to predict which issues will be posed by the new
paradigms of web 3:0 (with type-sensitive links pointing to focused search engines
depending on the topological category to which the anchor text belongs to) or web
4:0 (with personalized-links, dynamically constructed to redirect to the focus search
engine of that the user specify for the particular link type).
References
1. K. Aberer and J. Wu. A framework for decentralized ranking in web information retrieval.
In X. Zhou, Y. Zhang, and M.E. Orlowska, editors, APWeb, volume 2642 of Lecture Notes in
Computer Science, pages 213–226. Springer, 2003.
2. L. Adamic and E. Adar. How to search a social network. Social Neworks, 27(3):187–203, July
2005.
3. L. Adamic, R. Lukose, A. Puniyani, and B. Huberman. Search in power-law networks. Physical
Review E, 64, 2001.
16 Next Generation Search 399
4. R. Albert and A.-L. Barabasi. Statistical mechanics of complex networks. Reviews of Modern
Physics, 74(47), 2002.
5. J. Aldrich. R.A. Fisher and the making of maximum likelihood 1912-1922. Statist. Sci.,
(3):162–176, 1997.
6. S. Amer Yahia, M. Benedikt, and P. Bohannon. Challenges in searching online communities.
Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, pages 1–9,
2007.
7. R. Baeza-Yates, P. Boldi, and C. Castillo. Generalizing pagerank: damping functions for link-
based ranking algorithms. In Procs. of the ACM Conference on Research and Development in
Information Retrieval (SIGIR), 2006.
8. R. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, and F. Silvestri. Challenges on
distributed web retrieval. In Procs. of the IEEE 23rd International Conference on Data Engi-
neering (ICDE), 2007.
9. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May
1999.
10. N. Bansal and N. Koudas. Searching the blogosphere. In Procs. of the International Workshop
on the Web and Databases (WebDB), 2007.
11. A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286, 1999.
12. L. Becchetti, C. Castillo, D. Donato, R. Baeza-Yates, and S. Leonardi. Link analysis for web
spam detection. ACM Transactions on the Web (TWEB), 2(1):1–42, February 2008.
13. M. Benedikt, S. Amer Yahia, L. Lakshmanan, and J. Stoyanovich. Efficient network-aware
search in collaborative tagging sites. In Procs. of the 34th International Conference on Very
Large Databases (VLDB), 2008.
14. S. Bhagat, I. Rozenbaum, G. Cormode, S. Muthukrishnan, and H. Xue. No blog is an island —
analyzing connections across information networks. In Intlernational Conference on Weblogs
and Social Media (ICWSM), 2007.
15. K. Bharat, B.W. Chang, M. R. Henzinger, and M. Ruhl. Who links to whom: Mining linkage
between web sites. In Procs. of the IEEE International Conference on Data Mining (ICDM),
2001.
16. P. Boldi, R. Posenato, M. Santini, and S. Vigna. Traps and pitfalls of topic-biased pagerank. In
Fourth International Workshop on Algorithms and Models for the Web-Graph (WAW), 2008.
17. B. Bollobás. Mathematical results on scale-free random graphs. Handbook of Graphs and
Networks, 2002.
18. B. Bollobás and W. F. de la Vega. The diameter of random regular graphs. Combinatorica,
2(2), 1982.
19. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engines. Computer
Networks and ISDN Systems, 30(1–7):107–117, 1998.
20. Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking SVM to document
retrieval. In Procs. of the ACM Conference on Research and Development in Information
Retrieval (SIGIR), 2006.
21. D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM
Computer Surveys, 38(1), 2006.
22. Y.-Y. Chen, Q. Gan, and T. Suel. Local methods for estimating pagerank values. In Procs. of the
13nd ACM Conference on Information and Knowledge Management (CIKM), pages 381–389,
New York, NY, USA, 2004.
23. P. J. Courtois. Queueing and Computer System Applications. Academic Press, 1997.
24. D. De Solla Price. A general theory of bibliometric and other cumulative advantage processes.
Journal of the American Society for Information Science and Technology, 27, 1976.
25. G. M. Del Corso, A. Gulli, and F. Romani. Fast pagerank computation via a sparse linear
system. Internet Mathematics, 2(3), 2005.
26. G.M. Del Corso, A. Gulli, and F. Romani. Ranking a stream of news. In Procs. of the 14th
International Conference on World Wide Web (WWW), pages 97–106, 2005.
27. P. Erdős and A. Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci,
5, 1960.
400 D. Donato and A. Gionis
28. R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Procs.
of the 12th ACM Symposium on Principles of database systems (PODS), 2001.
29. D. Fogaras, B. Rácz, K. Csalogány, and T. Sarlós. Towards scaling fully personalized pageR-
ank: algorithms, lower bounds, and experiments. Internet Math., 2(3):333–358, 2005.
30. K. Fujimura and N. Tanimoto. The eigenrumor algorithm for calculating contributions in cy-
berspace communities. Trusting Agents for Trusting Electronic Societies, pages 59–74, 2005.
31. Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University
Press, October 1996.
32. F. Grey. Inferring probability of relevance using the method of logistic regression. In Procs. of
the ACM Conference on Research and Development in Information Retrieval (SIGIR), 1994.
33. A. Gulli. The anatomy of a news search engine. In WWW, 2005.
34. A. Gulli, S. Cataudella, and L. Foschini. Tc-socialrank: Ranking the social web. In Proceedings
of the 6th International Workshop on Algorithms and Models for the Web-Graph (WAW), 2009.
35. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank.
In Procs. of the 30th International Conference on Very Large Data Bases (VLDB), pages
576–587, Toronto, Canada, August 2004. Morgan Kaufmann.
36. T.H. Haveliwala. Topic-sensitive pagerank. In Procs. of the 11th International World Wide
Web Conference (WWW), Honolulu, Hawaii, May 2002.
37. P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web
search? In Procs. of the International Conference on Web Search and Web Data Mining
(WSDM), 2008.
38. Ask blog search. http://blog.ask.com/.
39. Google blog search. http://blogsearch.google.com/.
40. Ice rocket blog search. http://blogs.icerocket.com.
41. Blogpulse. http://www.blogpulse.com/.
42. The state of the live web, april 2007. http://www.sifry.com/alerts/archives/000493.html.
43. Technorati. whats percolating in blogs now. http://www.technorati.com.
44. S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Exploiting the block structure of the
web for computing pagerank. Technical report, Stanford University, 2003.
45. M. Kendall and J.D. Gibbons. Rank Correlation Methods. Edward Arnold, 1990.
46. J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,
46(5):604–632, 1999.
47. J.M. Kleinberg. Navigation in a small world. Nature, 6798, 2000.
48. J.M. Kleinberg. The Small-World Phenomenon: An Algorithmic Perspective. In Procs. of the
32nd ACM Symposium on Theory of Computing (STOC), 2000.
49. J.M. Kleinberg. Small-world phenomena and the dynamics of information. In Advances in
Neural Information Processing Systems (NIPS), 2001.
50. J.M. Kleinberg. Bursty and hierarchical structure in streams. In Procs. of the 8th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD), pages 91–101,
New York, NY, USA, 2002. ACM Press.
51. J.M. Kleinberg. Complex networks and decentralized search algorithms. In International
Congress of Mathematicians (ICM), 2006.
52. R. Kraft, C.C. Chang, F. Maghoul, and R. Kumar. Searching with context. In Procs. of the
15th International Conference on World Wide Web (WWW), 2006.
53. R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In
Procs. of the 12th International Conference on World Wide Web (WWW), pages 568–576. ACM
Press, 2003.
54. L. Lamport. Specifying Systems: The TLA+ Language and Tools for Hardware and Software
Engineers. Addison-Wesley Professional, July 2002.
55. A.N. Langville and C.D. Meyer. Updating pagerank with iterative aggregation. In Procs. of the
13th International World Wide Web Conference on Alternate track papers & posters (WWW),
pages 392–393, New York, NY, USA, 2004. ACM Press.
56. C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge
University Press, 2008.
16 Next Generation Search 401
57. C. Marlow, M. Naaman, D. Boyd, and M. Davis. Ht06, tagging paper, taxonomy, flickr,
academic article, to read. In Procs. of the 17th Conference on Hypertext and hypermedia (HY-
PERTEXT), 2006.
58. C.D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2000.
59. S. Milgram. The small world problem. Psychology Today, 2:60–67, 1967.
60. M. Mitzenmacher. A brief history of generative models for power law and lognormal distribu-
tions. Internet Mathematics, 1(2), 2003.
61. R. Nallapati. Discriminative models for information retrieval. In Procs. of the ACM Conference
on Research and Development in Information Retrieval (SIGIR), 2004.
62. M. Newman. The structure and function of complex networks. SIAM Review, 45(2), 2003.
63. J. Risson and T. Moors. Survey of research towards robust peer-to-peer networks: Search
methods. Technical report, Univ of New South Wales, Sydney Australia, 2006.
64. K. Sankaralingam, S. Sethumadhavan, and J.C. Browne. Distributed pagerank for p2p systems.
pages 58+. IEEE Computer Society, 2003.
65. H. Simon. On a class of skew distribution functions. Biometrica, 42(4/3), 1955.
66. H.A. Simon and A. Ando. Aggregation of variables in dynamic systems. Econometrica,
29:111–138, 1961.
67. A. Ukkonen, C. Castillo, D. Donato, and A. Gionis. Searching the wikipedia with contextual
information. In Procs. of the 17th ACM Conference on Information and knowledge manage-
ment (CIKM), 2008.
68. V. Von Brzeski, U. Irmak, and R. Kraft. Leveraging context in user-centric entity detection
systems. In Procs. of the 16th ACM Conference on Information and knowledge management
(CIKM), 2007.
69. Y. Wang and D. J. Dewitt. Computing pagerank in a distributed internet search system. In
Procs. of the 30th International Conference on Very Large Databases (VLDB), 2004.
70. D. Watts and S.H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 6684,
1998.
71. J. Wu and K. Aberer. Using siterank for P2P web retrieval. Technical Report IC/2004/31,
Swiss Federal Institute of Technology, Lausanne, Switzerland, 2004.
72. J. Xavier-Parreira, C. Castillo, D. Donato, S. Michel, and G. Weikum. The JXP method
for robust pagerank approximation in a peer-to-peer web search network. VLDB Journal,
17(2):291–313, 2008.
Chapter 17
At the Intersection of Networks and Highly
Interactive Online Games
Grenville Armitage
Abstract The game industry continues to evolves its techniques for extracting
the most realistic ‘immersion’ experience for players given the vagaries on best-
effort Internet service. A key challenge for service providers is understanding the
characteristics of traffic imposed on networks by games, and their service quality re-
quirements. Interactive online games are particularly susceptible to the side effects
of other non-interactive (or delay- and loss-tolerant) traffic sharing next- generation
access links. This creates challenges out toward the edges, where high-speed home
LANs squeeze through broadband consumer access links to reach the Internet. In
this chapter we identify a range of research work exploring many issues associated
with the intersection of highly interactive games and the Internet, and hopefully
stimulate some further thinking along these lines.
17.1 Introduction
Over the past decade online multiplayer computer games have emerged as a key
driver of consumer demand for higher-quality end-to-end Internet services. Early
adoption of ‘broadband’ last-mile technologies (such as cable modems or ADSL)
was driven as much by game players demanding better quality of service (QoS) as it
was by computer enthusiasts and telecommuters desiring quicker access to ‘the web’
and corporate networks. By the early 2000s, online game players were sometimes
even alerting network service providers of outages before the providers’ own mon-
itoring systems. Multiplayer games range from handfuls of players on individual
servers run by enthusiasts to tens of thousands of players linked by geographi-
cally distributed and commercially operated server clusters. The games themselves
vary widely – from highly interactive ‘first-person shooter’ and car-racing games
to more sedate role-playing and strategy games, or even modern re-interpretations
of traditional card games and board games [10]. A common attribute of online
G. Armitage ()
Centre for Advanced Internet Architectures, Swinburne University of Technology, PO Box 218,
John Street, Hawthorn, Victoria 3122, Australia
e-mail: [email protected]
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 403
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 17,
c Springer-Verlag London Limited 2010
404 G. Armitage
games is creation of a shared, virtual environment (or virtual world) within which
players ‘immerse’ themselves. Realistic immersion depends on timely exchange of
up-to-date information between participants in the virtual world. Consequently, un-
derstanding and supporting the needs of multiplayer online games is one of the
many interesting challenges facing next-generation networks. Online games are
particularly susceptible to the side effects of other non-interactive (or delay- and
loss-tolerant) traffic sharing next-generation access links.
A single taxonomy of game types or genres has yet to emerge from the game in-
dustry or academia. For the purpose of this chapter we will classify games by their
different requirements for interactivity, using names that are relatively common in
today’s online forums.
First-person shooter (FPS) games are arguably the most network-intensive, mul-
tiplayer online game genre.1 FPS games typically involve players moving around a
virtual world where the goal of the game is to shoot as many ‘enemy’ players (or
monsters, as the game story line dictates) as possible as quickly as possible. As the
name suggests, a player’s viewpoint is rendered in the first person and winning usu-
ally depends on your speed at accumulating kills (often known as ‘frags’). Because
success relies so much on reaction times FPS games are sometimes referred to as
‘twitch games,’ and players are often intolerant of network disturbances that last
beyond tens of milliseconds [4, 9, 13, 27, 32, 53].
Networked, multiplayer forms of role-playing games (RPGs) and real-time strat-
egy (RTS) games also emerged during the 1990s.2 The game environment is often
presented in a third-person or top-down view, where the human player controls
one or more characters within a virtual world. The concept of ‘winning’ in such
games is usually tied to solving puzzles, mapping out strategies, exploring the vir-
tual world, and negotiating (or collaborating) with other players. Unlike FPS games,
frenetic real-time interactions are relatively rare. Consequently, players will often
tolerate network-induced disruptions lasting for multiple seconds [26, 45]. How-
ever, during brief periods of highly interactive play (if the game supports or requires
real-time fighting scenes) the players may exhibit FPS-like intolerance of network
disturbances.
1
In 1996, idSoftware’s Quakeworld (a variant of their single-player FPS game Quake) was one of
the earliest FPS games developed specifically to support highly interactive multiplayer operation
over a wide-area network. Prior to Quakeworld, idSoftware’s Doom had been modified for LAN-
based games, but was suboptimal for wide-area (Internet-based) play. Modern examples include
Counterstrike:Source, Unreal Tournament 2007, and Enemy Territory:Quake Wars.
2
Massively multiplayer online games (MMORPGs) were first popularized in their modern form
by Ultima Online in 1997. Modern examples include World of Warcraft, Everquest II, and Eve
Online. Modern examples of multiplayer RTS games include Warcraft 3, Company of Heros, and
World in Conflict. Such games are usually built to support tens of thousands of concurrent players.
17 At the Intersection of Networks and Highly Interactive Online Games 405
The front line of all online games comprises human players interacting with game
clients, software coupled to an underlying hardware platform (loosely in the case of
general purpose PCs, or tightly in the case of consoles or hand-held game devices).
Game clients are responsible for establishing the immersive in-game experience for
their players, and communicating with other clients to keep the shared virtual envi-
ronment up to date. Typically, this involves sending and receiving regular updates,
informing other clients what your player is doing, and learning from other clients
what their player is (or players are) doing.
Conceptually, a game could operate in a pure peer-to-peer mode, with every
client exchanging information with every other client to maintain a common, shared
view of the virtual environment. However, the traffic mesh resulting from pure peer-
to-peer communication is problematic at the network layer. Consider a game with
N clients. All N clients must have sufficient in-bound network capacity to handle
the traffic from (N 1) other clients, and out-bound capacity to support the traffic
sent to (N 1) other clients. Every client’s messages are likely to be short (a few
tens of bytes indicating that, for example, a player is moving, or just triggered a
weapon) and of very similar size. Short messages mean each IP packet contains
very little information. Furthermore, each time a client updates their peers they will
emit (N 1) identical messages, and N will be constrained by the outbound capac-
ity of (typically asymmetric) consumer access links.3 Finally, shared knowledge of
everyone’s IP address allows malicious players to launch direct IP-layer denial of
service (DoS) attacks on other players (overloading the target’s network link and
disrupting the target’s game play).
A client server alternative involves clients communicating indirectly through a
central server. The simplest approach is to emulate network layer multicast – clients
unicast each game update message to the server, which unicasts copies of the update
messages back out to every other client. Each client now sends only one message
per update (rather than N 1 messages), and clients no longer know each others’ IP
3
It would be advantageous to use multicast rather than unicast, yet network layer multicast is
largely theoretical in consumer environments.
406 G. Armitage
addresses (reducing the risk of IP-layer DoS attacks).4 However, most online games
take this one step further and implement game-logic in the server to make it a central
arbiter of all events and activities occurring inside the game world. Update messages
arriving from individual clients are interpreted and vetted by the game server to en-
sure each player’s actions are consistent with the game world’s rules. At regular
time intervals the game server unicasts aggregate update messages (snapshots) to
all clients, informing them of events and actions occurring since the last snapshot.5
A single snapshot replaces (N 1) update messages from individual clients, reduc-
ing the number of packets per second required to support N clients.6
Client server is usually preferred over peer-to-peer because it reduces the client-
side bandwidth requirements and centralizes cheat-mitigation and game rule en-
forcement in a (relatively) trusted third party [10]. However, servers can come under
significant pressure for networking and computational resources, particularly for
games that create a single environment for thousands or tens of thousands of play-
ers. Large-scale games will often use a hybrid scheme, where the ‘game server’ is
really a cluster of physically separate servers using a peer-to-peer mesh amongst
themselves to create an illusion of coordinated game state processing.
Both UDP (user datagram protocol) [42] and TCP (transmission control proto-
col) [28, 43] have been used for sending update and snapshot messages over IP.
The choice tends to depend on a game’s particular trade-off between timeliness and
reliability of sending messages. TCP will transfer messages reliably yet introduce
multisecond delays in the face of short-term network degradation. UDP will ei-
ther get messages through or lose them completely, without causing any additional
delays to subsequent messages.
Unlike relatively delay- and loss-tolerant applications (such as email, instant mes-
saging, peer-to-peer file transfer, and streaming non-interactive multimedia content),
online games make consumers intimately aware of degradations to network service.
Unfortunately, today’s consumer broadband access links tend to lump all traffic to-
gether at the home–Internet boundary. A game player’s experience may become
collateral damage as other applications push and shove their way across the con-
sumer’s home link.
4
From the perspective of each client’s game logic this model is still peer-to-peer, as the server
simply repeats and replicates messages at the network level.
5
Snapshots typically occur tens of times per second to ensure smooth approximation to real-time
updates of player movements and actions.
6
A common optimization is for the server to send client X a snapshot that excludes information on
events and avatars not visible to the player controlling client X (e.g., because the player’s view is
obscured by a wall). This limits the potential for a hacked client to reveal information about other
players who believe themselves to be hidden behind some in-game obstruction.
17 At the Intersection of Networks and Highly Interactive Online Games 407
This chapter’s main focus will be FPS games – due to their significantly interactive
nature they provide a good illustration of games that are intolerant of fluctuations
7
Typically a small router or switch integrated with a broadband ADSL or Cable modem.
8
Variants of traditional ‘NewReno’ TCP have also emerged in the past decade – such as HTCP
(available for Linux and FreeBSD), CUBIC (now the default in Linux), and Compound TCP (an
option for Windows Vista). Experimental trials with CUBIC and HTCP revealed that TCP sessions
using either algorithm cause faster cycling of the latency through a congested router than NewReno
sessions. They also generally induced higher latencies than NewReno [12, 46].
408 G. Armitage
in, and degradations of, network service. We will consider how FPS games operate,
summarize their latency sensitively, and review the traffic they impose on networks
over short and long time frames. We will then review recent work on simulating
FPS game traffic (for ISP network engineering), techniques for connecting clients
together with servers who are ‘close enough,’ and conclude with emerging tech-
niques for detecting game traffic running across an ISP network.
Internet-based FPS games generally operate in a client server mode, with game
servers being hosted by Internet service providers (ISPs), dedicated game hosting
companies, and individual enthusiasts. Although individual FPS game servers typ-
ically host only from 4 to around 30+ players, there are usually many thousands of
individually operated game servers active on the Internet at any time.
Different publishing models also exist. PC-based FPS games commonly (and
traditionally) utilize a rendezvous service – game servers are explicitly established
by end-users, and announced on a master server (funded by the game publisher)
that clients then query. More recently, some console-based games (such as Ghost
Recon and Halo on XBox Live [35]) utilize a central server for ‘matchmaking’ –
dynamically allocating one of a group of peer consoles to be the game server for
individual game-play sessions.
FPS games utilize the network in three distinct ways: server discovery, game
play, and content downloads. Each of these impose different requirements on the
network, and create a different aspect to the player’s overall experience.
In this chapter we will focus on the traditional model of hosting and discovering
game servers that originated with PC-based games. Players trigger server discovery
to populate or refresh their game client’s on-screen ‘server browser’ (a list of avail-
able game servers). Clients first query a well-known master server, which returns
a list of all registered game servers (usually broken over multiple reply packets, as
the lists can be quite long). Clients then probe each game server in turn for informa-
tion (such as the current map type, game type, and players already on the server).
The probe is typically a brief UDP packet exchange, which allows the client to also
estimate the round trip time (RTT)9 between itself and each game server. Players
are presented with this information as it is gathered, and then select a game server
to join.
9
Network latency, also colloquially known as ‘lag’ or ‘ping’ in online game communities (the
latter due to ping being the name of a common tool for measuring RTT).
17 At the Intersection of Networks and Highly Interactive Online Games 409
A given client will send out hundreds or thousands of probe packets to find and
join only one game server. Consequently, individual game servers end up receiving,
and responding to, tens of thousands of probe packets unrelated to the number of
people actually playing (or likely to play) at any given time. The ‘background noise’
due to probe traffic fluctuates over time as game clients around the Internet startup
and shutdown [52].
Game play begins once a suitable server has been selected. This represents the
primary period throughout which the FPS game players are most sensitive to fluc-
tuations or degradation in network service. Client(s) send regular (although not
necessarily consistently spaced) update messages to the server and the server will
send regular snapshot messages back. The traffic flowing between client and server
is usually asymmetric – client-to-server packets are often under 100 bytes long,
whereas server-to-client packets will often range from 80 to 300C bytes long. Up-
date and snapshot messages are sent tens of times per second to ensure that events
within the virtual world are propagated among all players in a timely and believable
fashion. Messages are sent in UDP packets, with the FPS game engines providing
their own mechanisms above UDP for packet loss detection and mitigation.
Online games often evolve new characters, virtual worlds, items inside the vir-
tual worlds, and software patches long after the original game software is released
(or sold) to the public. Most FPS games now have mechanisms for downloading new
or updated in-game content, and some are even capable of automatically updating
their executable code. Depending on the game’s design, content download may oc-
cur before, during, or after game play. It may be triggered by connecting to a server
that’s running a virtual environment (‘map’) not previously seen by the client, or be
a scheduled update initiated by the game client.
Some content (such as new ‘skins’ for player avatars) may be downloaded while
the game is in progress, in parallel with the network traffic associated with game
play. Other content updates (such as new maps) must be completed before the client
may begin a new game. Depending on how the content update is triggered, new
content may come directly from the game server, or the game server may redirect
the client to pull updated content from an entirely separate server.
Regardless of origin, content updates are infrequent events (days, weeks, or
months apart) and usually designed not to interfere with game play. Consequently
we will not discuss them in further detail.
410 G. Armitage
Due to the fast-pace and highly interactive nature of FPS games, players prefer
game servers that exhibit low RTT. For example, in a typical FPS ‘death match’
game players win by having the highest total kills (or ‘frags’) after a certain time
period (typically tens of minutes), or by being the first to reach a particular number
of frags. Figure 17.1 (from [4, 9]) shows the general impact of network latency on
frag rate for Quake III Arena (Q3A) players in 2001. Over a period of minutes the
cumulative benefit of being 50ms rather than 150ms from the game server can be
significant.10
2.5
Frags per minute
2.0
1.5
1.0
0.5
0 50 100 150 200 250 300 350 400
Median RTT (milliseconds)
Fig. 17.1 Success (measured in ‘frags’ per minute) of Quake III Arena players in 2001 as a func-
tion of the player’s median RTT from a server [4]
10
In Figure 17.1 identical Q3A servers were based in California and London over a period of
months. The set of players on each server only partially overlapped, yet a similar trend is evident
in each case.
17 At the Intersection of Networks and Highly Interactive Online Games 411
Published literature suggests that competitive online FPS game play requires
latencies below 150–200 ms [10, 35]. Sports and strategy games have also demon-
strated some latency intolerance, but less so than FPS overall as the game play is
not always frenetic [13, 27, 41].
Given an intolerance for RTT over roughly 150-200ms, it is easy to see why
online gamers abhor RTT fluctuations induced by transient queuing delays along
their network paths. Service providers who can stabilize the RTT experienced by
game players will likely gain an appreciative market.
11
Game servers track RTT to each client so that they know how much roll back is required for
events involving different players.
412 G. Armitage
As noted in Section 17.2.1, server discovery is the process by which FPS game
clients locate up-to-date information about active game servers so that a player can
select a suitable server on which to play. Server discovery can trigger mega bytes of
network traffic per client and take multiple minutes to complete. Network devices
that keep per-flow state (such as NAT-enabled home routers) can experience bursts
of thousands of dynamically created state entries, tying up memory for minutes
simply to enable a sub-second packet exchange. Finding servers with low RTT is
a common goal, and optimizing this process is an active area of research [5, 6, 11,
24, 25, 35]. In this section, we will briefly review the server discovery techniques
of Valve Corporation’s Counterstrike:Source (CS:S) [47], then summarize a recent
idea for reducing the number of probe flows generated during FPS server discovery.
12
Valve has no specific names for the messages between master server and client, so getservers,
and getserversResponse have been chosen for clarity. UDP/IP packet formats for server discovery
are available online [48].
17 At the Intersection of Networks and Highly Interactive Online Games 413
Send one A2S INFO Request probe packet to each game server in order, eliciting
an A2S INFO Reply packet from every active game server.
Repeat the previous steps until the Steam master server has no more game server
details to return.
Each game server’s RTT is estimated from the time between the client sending
an A2S INFO Request and receiving the A2S INFO Reply. Server-specific informa-
tion in each A2S INFO Reply is used to update the Steam client’s on-screen server
browser, enabling players to select their preferred server for game play.
Third-party server browsers may use a different sequence. Qstat [1] (an open-
source server browser) retrieves all registered servers first (using back-to-back
getserver queries) before issuing A2S INFO Request probes. They achieve a sim-
ilar result – information (including RTT) is collected about all active servers and
presented for player selection.
In late 2007, the master server would return between 28 and 31 K servers, of
which roughly 27 K responded to probes [6]. A single server discovery sequence
would result in roughly 5 Mbytes of traffic,13 and show quite different RTT distri-
butions depending on the client’s location.
Figures 17.3 and 17.4 (from [6]) show the game server RTTs versus time mea-
sured by German and Taiwanese clients during server discovery. At a nominal rate
0.5
0.4
RTT (seconds)
0.3
0.2
0.1
0.0
0 50 100 150 200 250
Seconds since first probe
Fig. 17.3 Counterstrike:Source game server RTTs vs time as seen from Germany in Nov’07
13
Outbound A2S INFO Requests (53-byte UDP/IP packets) account for about 30%, with the
remainder (inbound) made up of variable-length A2S INFO Replies.
414 G. Armitage
0.4
RTT (seconds)
0.3
0.2
0.1
0.0
0 50 100 150 200 250
Seconds since first probe
Fig. 17.4 Counterstrike:Source game server RTTs vs time as seen from Taiwan in Nov’07
14
Approximating a Steam client configured for ‘DSL > 256K’ network access.
15
This also reduces the level of transient per-flow state created in network devices such as NAT-
enabled home routers (which often retain new UDP flow mapping state for minutes after each
sub-second A2S INFO Request/Reply transaction).
17 At the Intersection of Networks and Highly Interactive Online Games 415
90
80
70
60
CDF (%)
50
40 AU
JP
30
DE
20 TW
UK
10
USA
0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
RTT (seconds)
Fig. 17.5 Distribution of RTTs measured to all Counterstrike:Source game servers from six client
locations
The challenge of finding FPS servers with low enough RTT is well recognized [25].
Early research focused on relocating clients to optimally placed servers (e.g., [24]),
rather than optimizing the server discovery process itself. Ideally, we would like to
reduce the tens of thousands of short-lived UDP probe flows generated by current
server discovery. This can be achieved by probing closer servers before those of
more distant servers, and automatically terminating the search sequence once it no
longer seems likely the client will find any more game servers with ‘playable’ RTT.
As noted in [5, 6] the problem appears contradictory: we wish to probe game
servers in order of ascending RTT before we have probed them to establish their
416 G. Armitage
RTT. A master server cannot pre-sort the list of game servers in order of ascending
RTT because it cannot know the network conditions existing between any given
client and every game server.16 Every client must emit active probes to establish
their RTT to individual game servers.
There are three key steps to optimized server discovery: clustering, calibration, and
optimized probing. In 2006, the author hypothesized that a client might locally re-
order the probe sequence so that game servers in countries ‘closer’ to the client
would be probed before those ‘further away’ [11]. First, game servers returned by
the master server would be clustered in such a way that members of a cluster were
likely to share similar RTT from the client. Calibration involves probing a subset of
servers in each cluster, providing an estimate of the RTT to each cluster relative to
the client’s current location. Finally, the clusters would be ranked in ascending order
of estimated RTT, and all remaining game servers probed in order of their cluster’s
rank (optimized probing).
Clustering by country codes17 and high-order bits of the server IP addresses was
proposed in [11] and explored in [5]. Unfortunately, country codes are a very coarse
indicator of topological locality and the estimated RTT for particular countries often
bore little relation to the spread of RTTs of individual servers falling under the same
country code. An alternative approach clustered game servers by the Autonomous
System18 (AS) to which each game server’s IP address belongs (its origin AS) [6].
Game servers sharing a common origin AS are likely to share a similar distance
(and hence RTT) from any given client.19
During the optimized probing phase, successive RTT estimates tend to trend
upwards (although the trend will fluctuate due to variability of RTT to individual
game servers in each cluster). Nevertheless [6] illustrates that a client can imple-
ment early termination of server discovery when recent RTT estimates exceed some
player-specified threshold. Early termination reduces the number of probes emitted,
the number of UDP flows generated, and the player’s time waiting to know if all
playable servers have been probed. (The latter is particularly relevant for clients a
long way from many servers.)
16
A master server also cannot trust other clients to accurately report such information from dif-
ferent parts of the Internet. A few misbehaving players injecting corrupt RTT information could
easily disrupt such a system.
17
For example, using MaxMind’s free GeoLite Country database [36].
18
AS numbers are used in inter-domain routing (by the Border Gateway Protocol, BGP [44]) to
identify topologically distinct regions of the Internet.
19
Reference [6] further refines the definition of a cluster to catch origin ASes covering a wide
geographical area (and hence a potentially wide spread of RTTs). However, the details are beyond
the scope of this chapter.
17 At the Intersection of Networks and Highly Interactive Online Games 417
Figure 17.6 (from [6]) illustrates the impact on CS:S clients located in Taiwan. For
each client the optimized probe sequence generates traffic in two phases – ‘cali-
bration probes’ and ‘reordered probes.’ It takes roughly 23 s (approximately 3,200
probes at 140/s) to calibrate before issuing reordered probes. Early termination oc-
curs when the ‘auto-stop estimator’ exceeds the player RTT tolerance (set to 200 ms
for this example), and the estimated auto-stop time is indicated by a dashed vertical
line. Reordered probes are plotted beyond auto-stop to illustrate the effectiveness of
reordering.
Taiwanese clients see a distinct improvement – auto-stop causes the probing to
terminate well before the roughly 230 s period taken by regular server discovery.
A similar improvement is seen in [6] for Australian and Japanese clients too.
Because they are close to large concentrations of game servers under 200 ms,
clients based in the UK (Figure 17.7 from [6]) see a fairly neutral impact. Some
servers over 200 ms are seen toward the end of the probe sequence, but even with
auto-stop such the clients ultimately probe almost all active servers. Similar results
are seen in [6] for clients based in the USA, and Germany.
With an optimized search order and auto-stop, clients far away from most game
servers see significant reduction in the number of probes emitted by their clients
before concluding that all playable servers have been seen. This reduces the player’s
wait time, the number of transient UDP flows in the network and the number of bytes
sent or received.
0.5
0.4
RTT (seconds)
0.3
0.3
0.2
0.1
0.0
0 50 100 150 200 250
Seconds since first probe
[6] proposes that integration of an open-source tool (such as Quagga [2]) into the
Steam master server could allow mapping of game server IP addresses to origin
AS numbers to occur at the master server.20 Currently, the master server returns an
unordered list S1 , S2 , ... SN where Sx is a 6-byte hIP addr:porti pair. AS numbers
may be embedded by instead returning a list of the form AS1 , S11 , S12 , ..., S1N ,
AS2 , S21 , S22 , ..., S2N ..., and so on. Each ASx indicates the origin AS to which
the following game servers belong. The ASx is encoded as 6-byte Sx field with the
‘port’ field set to zero (an otherwise invalid port for a genuine game server) and the
nonzero AS number encoded in the 4-byte ‘IP addr’ field. (New AS numbers take
4 bytes. Traditional 2-byte AS numbers would be encoded as 4-byte numbers with
the top 2 bytes zeroed.)
Embedding AS numbers will increase the typical master server reply list by
roughly 4% (roughly 1,200 additional 6-byte ASx markers, or 7 Kbytes). However,
there is a net gain to the client if auto-stop eliminates roughly 55 or more A2S INFO
Reply packets (which are often over 135 bytes long). For many clients auto-stop will
eliminate hundreds or thousands of A2S INFO Request/Reply probes.
20
In principle clustering might also be performed at the client, but this would require all clients
have access to up-to-date BGP routing information (impractical) and create additional network
traffic for limited return.
17 At the Intersection of Networks and Highly Interactive Online Games 419
Retrieving all game servers (and ASx markers) from the master server before
initiating calibration and reordered probing would be a change for Steam clients,
but represents no change to server browsers such as Qstat.
There is still room for improvement in [6], particularly in the trade-off between
the number (and distribution) of probes sent during calibration and the accuracy of
subsequent cluster rankings. Nevertheless, this technique may be applied to other
FPS games that use a similar server discovery process.
Leaving aside certain innovative Asian markets, most game players today utilize
consumer ‘broadband’ access links with upstream rates between 128 Kbps and
1CMbps. This places a practical upper bound on the speed with which server dis-
covery probes may be sent. Another issue is the mismatch between home LAN
speeds and the access link. Game clients must shape their probe traffic to minimize
inflation of subsequent RTT estimates.
For example, Valve’s Steam client’s bursty emission of A2S INFO Request
probes has been shown to inflate RTT estimates even when the average rate would
appear acceptable [7]. CS:S players may influence their A2S INFO Requests emis-
sion rate by configuring their Steam client to assume a network connection of
“Modem-56 K”, “DSL > 256K”, etc. In [7] a Steam client in Australia was con-
nected to the Internet in early 2008 via an 100 Mbps LAN and ADSL2C link.21
Figures 17.8 and 17.9 (taken from [7]) illustrate the different spread of RTTs expe-
rienced when the Steam client was configured for “modem-56K” (measured at 35
probes/second) and “DSL/Cable > 2M” (measured at 319 probes/s) modes, respec-
tively. Relative to Figure 17.8, Figure 17.9 showing a noticeable ‘smearing upwards’
of estimated RTTs belonging to servers in different regions of the planet.
Closer inspection of the Steam client’s traffic revealed that probes were being
sent in bursts at LAN line-rate, causing additional queuing delays along the out-
bound link. Figure 17.10 compares (as cumulative distribution of measured RTTs)
the Steam client’s results to those obtained using Qstat 2.11 with default settings
(measured as 54 probes/s).22 Steam’s burstiness when configured for “DSL/Cable
> 2M” has clear impact. Interestingly, when configured for “modem - 56K” the
21
Steam client built: Jan 9 2008, at 15:08:59, Steam API: v007, Steam package versions: 41/457.
ADSL2 C link synchronized at 835 Kbps up and 10866 Kbps down.
22
Figure 17.10 reveals, a small number of servers are within Australia or Asia (roughly 20ms to
140ms), a moderate number of servers are in North America (the 200msC range), and a far larger
community of game servers are in Europe (330msC). The need to jump oceans to reach Asia,
North America, and then Europe leads to the distinctly non-uniform distribution of RTTs.
420 G. Armitage
0.5
0.4
RTT (seconds)
0.3
0.2
0.1
0.0
0 100 200 300 400 500 600 700 800 900
Seconds since first probe
Fig. 17.8 Estimated RTT versus time – Steam client at low speed (“modem-56 K”) setting
Standard probes vs time: Steam−319pps
0.6
28263 valid probes (at ~319.2/sec)
0.5
0.4
RTT (seconds)
0.3
0.2
0.1
0.0
0 10 20 30 40 50 60 70 80 90 100
Seconds since first probe
Fig. 17.9 Estimated RTT versus time – Steam client at low speed (“DSL/Cable > 2M”) setting
Steam client’s burstiness still caused a slight inflation of RTT estimates relative to
Qstat (despite Qstat probing faster on average).23
23
When running in “DSL/Cable > 2M” mode, over 90% of the probes emitted by the Steam client
are less than 1ms apart. In “modem - 56K” mode the Steam client emits 70% of its probe packets
less than 1ms apart. Qstat makes a modest attempt to ‘pace’ the transmission of probe packets.
17 At the Intersection of Networks and Highly Interactive Online Games 421
70
60
CDF (%)
50
40
30
20
10
0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
RTT (seconds)
Traffic between an online game client and remote servers will encounter packet
queues in any locations – such as routers, switches, bridges, and modems. A good
FPS client should attempt to smoothly pace out the transmission of probe packets,
in addition to ensuring, the average probe packet rate does not exceed the player’s
available upstream network capacity.
Traffic patterns are of interest when investigating the sharing of network infras-
tructure between online games and more traditional, non-interactive services. For
trend analysis and capacity planning, aggregate patterns over long-timescale (hours,
days, and weeks) are of interest. Short-timescale traffic characteristics (such as geo-
graphic/topological diversity of participating network endpoints, inter-packet arrival
times, and packet size distributions) are important when evaluating IP quality of ser-
vice (QoS) implications during game play.
Whether long or short timescale, traffic is easily monitored using conventional
packet capture tools (such as tcpdump) and post-analysis scripts. Predicting traffic
from direct inspection of an FPS game’s source code cannot replace actual measure-
ments. First, most games do not release their source code. Second, the common use
of delta compression (to keep packets small) means the actual distribution of traffic
depends on how players actually tend to interact in each virtual world.
422 G. Armitage
The next step is to create synthetic models of typical game traffic so, as network
engineers, we may simulate the impact of game traffic under circumstances different
to those we can measure empirically [14–19, 23, 29–31, 33, 34].
24
Two ET servers were observed in [52]. Game player accounted for 116GB of traffic on one
server and 14GB on the other over 20 weeks. Yet over the same period both servers were equally
impacted by server discovery, experiencing 19 million short-lived flows totalling 8GB of traffic
each.
17 At the Intersection of Networks and Highly Interactive Online Games 423
6000 Australia
North America
5000 Asia
Probe flows per hour
4000
3000
2000
1000
0
0 5 10 15 20 25
Hour of day (Australian East Coast)
Fig. 17.11 Density of server discovery traffic over time experienced by a Wolfenstein Enemy
Territory server in 2005
7 North America
Asia
Game play flows per hour
0
0 5 10 15 20 25
Hour of day (Australian East Coast)
Fig. 17.12 Density of game-play traffic over time experienced by a Wolfenstein Enemy Territory
server in 2005
424 G. Armitage
Game play is the period when interactions between game traffic and non-game traf-
fic are most important. During game play, and over short timescales, most FPS
games exhibit limited variation in packet size from client-to-server and larger vari-
ation from server to client.
Figures 17.13 (from [33]) and 17.14 (from [51]) illustrate the client-to-server
packet size distributions of Half Life and Halo 2, respectively (both during death-
match games). Many FPS games tend to send client-to-server updates varying from
10 to 50 ms apart, so there is not too much to report in any given update message. In
Half Life there is only one player per client, and Figure 17.13 illustrates that packet
sizes are not influenced by particular choice of map. The Halo 2 client is Microsoft’s
XBox game console, which supports up to four players and leads to Figure 17.14’s
four different packet distributions.25
Traffic in the server-to-client direction is typically emitted as regular bursts of
back-to-back snapshot packets, one to each of the clients attached to the game server
at any given time.26 The precise interval between snapshots depends on the FPS
game itself and local configuration, but typically ranges from 15 to 50 ms (Chapter
10, [10]). Figures 17.15 (from [33]) and 17.16 (from [51]) use Half Life and Halo 2,
respectively (both during deathmatch games) to illustrate the influence on snapshot
sizes of the number of players and type of map (as they reflect all changes in game
7 Map: Odyssey
6
Percentage (%)
0
50 55 60 65 70 75 80 85 90 95
Packet size (bytes)
Fig. 17.13 Client-to-server packet sizes for Half Life (original) with three different game maps
25
Figure 17.14 also reflects the fact that Halo 2’s client update messages increment in 8 byte jumps.
26
Snapshots are sent in bursts to minimize any unfairness between clients receiving their individual
snapshots at different times.
17 At the Intersection of Networks and Highly Interactive Online Games 425
60 2 players
3 players
4 players
50
Percentage (%)
40
30
20
10
0
40 60 80 100 120 140 160
Packet size (bytes)
Fig. 17.14 Client-to-server packet sizes for Halo 2 (Xbox) with varying numbers of players on
one client (console) and the same map
2.0
Percentage (%)
1.5
1.0
0.5
0.0
50 100 150 200 250 300 350
Packet size (bytes)
Fig. 17.15 Server-to-client packet sizes for Half Life (original) with three different game maps
426 G. Armitage
15
10
0
50 100 150 200 250 300 350 400
Packet size (bytes)
Fig. 17.16 Server-to-client packet sizes for Halo 2 (Xbox) with varying total numbers of players
and the same map
state each client needs to see).27 For a given number of players the type of map
can influence how often individual players see or interact with each other – longer
snapshots occur when more player interactions occur per unit time (Figure 17.15).
Similarly, for a given map we see more player interactions per unit time as the total
number of players increases (Figure 17.16).
Snapshot burstiness drives up the bandwidth required of network links close to
a game server. For example, an FPS game server serving 30 clients at 20 snapshots
per second will emit 30 snapshots back-to-back once every 50 ms. A game server
link provisioned for a smooth flow of 600 snapshots per second will become con-
gested every 50 ms, queuing (and delaying) most snapshots in the burst. This creates
a slight bias against the client whose snapshot is transmitted last. Network links
carrying traffic away from a server must be overprovisioned to minimize queuing
delays during each burst of snapshots.
27
Most FPS games minimize bandwidth requirements by using delta compression to avoid sending
updates about entities whose state has not changed since the previous snapshot.
17 At the Intersection of Networks and Highly Interactive Online Games 427
server-to-client traffic with more than ten (or so) players can be a practical chal-
lenge to set up and manage. However, it appears that reasonable approximations of
many-player FPS server-to-client traffic may be synthesized from traffic captured
with small numbers of players (such as two- and three-player games) [15–19].
Two assumptions underly a simple synthesis technique [15]. First, assume that
the nature of game play for individual players does not significantly change regard-
less of the number of players. (Each player spends similar amounts of time involved
in exploring the map, collecting useful items, and engaging in battles regardless of
the number of players.) Second, assume that players have similar behavior. (They
may not be of similar ability but will engage in similar activities in much the same
way as each other.) Consequently, the random variable describing the packet length
of an N -player game can be constructed through adding together the random vari-
ables of smaller player games. We can thus construct the server-to-client packet size
probability mass function (PMF)28 of an N -player game by suitable convolutions
of the (measured) PMFs of two- and three-player games.
Figures 17.17 and 17.18 (both from [15]) compare a synthesized PMF with em-
pirically measured PMF of server-to-client packet size for five- and nine-player Half
Life 2 Deathmatch games. The synthesized five-player PMF involved convolving a
two-player PMF and three-player PMF together. The synthesized nine-player PMF
involved convolving a three-player PMF three times. On the right of each PMF is a
Q-Q plot whose 45-degree diagonal indicates reasonably good match between the
measured and synthesized distributions.
As N increases there is increased divergence between reality and the synthesized
PMFs. The scheme’s utility can be extended by starting with larger empirically
measured PMFs (e.g., synthesizing an 18-player PMF by convolving a measured
nine-player PMF with itself).
_
3
x 10
8
1.0
Empirical
Synthetic
0.8
6
cdf5predicted
0.6
Probability
4
0.4
2
0.2
0.0
0
0 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0
Packet length cdf5p
Fig. 17.17 Measured and synthesized server-to-client packet sizes for Half Life 2 Deathmatch for
five-player match
28
A probability density function (PDF) for variables, such as packet sizes, that take on discrete
values.
428 G. Armitage
-3
x 10
1.0
5
Empirical
Synthetic
0.8
4
cdf9predicted
0.6
Probability
0.4
2
0.2
1
0.0
0
0 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0
Packet length cdf9p
Fig. 17.18 Measured and synthesized server-to-client packet sizes for Half Life 2 Deathmatch for
nine-player match
0.9
0.8
0.7
Cumulative fraction
0.6
0.5
0.4
0.3
0.2
Actual inter−probe inter val distribution
0.1
Standard exponential distribution (median 0.326 seconds)
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Interval (seconds)
Fig. 17.19 Distribution of intervals between incoming server discovery probes during busiest
hours (Wolfenstein Enemy Territory in April 2006)
A typical FPS game server experiences server discovery traffic as a 24-h per
day background noise. Arrivals of Wolfenstein Enemy Territory server discovery
probes have been seen to be uncorrelated and exhibit exponentially distributed inter-
probe intervals during both busiest and least-busy hours of the 24-hour cycle [8].
Figure 17.19 (from [8]) shows how inter-probe intervals closely resemble an expo-
nential distribution (using the median inter-probe interval at any given time). The
measured distribution is based on arrivals during the busiest hour of each day in
17 At the Intersection of Networks and Highly Interactive Online Games 429
0.20
Auto−correlation
0.15
0.10
0.05
0.00
0 10 20 30 40 50 60 70 80 90 100
lag (number of probes)
Fig. 17.20 Auto-correlation of intervals between incoming server discovery probes during busiest
hours (Wolfenstein Enemy Territory in April 2006)
April 2006. Figure 17.20 (also from [8]) shows the probe arrivals are uncorrelated
(using samples from a particular ‘busy hour’).
The author has seen similar patterns for Counterstrike:Source server discovery
probes. As server discovery is triggered by independent human behavior across
the planet, it seems reasonable to expect uncorrelated and exponentially distributed
inter-probe intervals for many (if not all) FPS games.
Service providers may chose to detect and/or track game traffic for reasons such as
network performance monitoring, auditing customer usage, automated network re-
configuration, and market research. However, isolating (or merely identifying) game
traffic flows amongst the hundreds of thousands of flows seen in a service providers
core network is non-trivial.
It would be challenging to track the many ‘well-known’ UDP or TCP port num-
bers associated with every existing and future online game, and existing games are
not always run on their ‘official’ ports [50]. Packets might be classified by inspect-
ing UDP or TCP payloads, looking for evidence of game protocols in use. However,
keeping track of every future permutation of online game protocols would be prob-
lematic at best. Furthermore, in some jurisdictions the legality of inspecting the
payloads of packets without appropriate authorization is in doubt.
430 G. Armitage
300
250
Packet Length Stddev.(bytes)
200
150
150
100
100
50
50
0
Fig. 17.21 Distribution of two ML features during four different periods of time for Wolfenstein
Enemy Territory client-to-server traffic
29
Statistics such as the distribution of packet sizes or inter-packet arrival times are attractive
because they can be measured by observing ‘external’ attributes of packets (timestamps and length
fields), side-stepping any need to ‘look inside’ the packets.
17 At the Intersection of Networks and Highly Interactive Online Games 431
700
700
C−S Probing Full Flow
Packet Length Stddev.(bytes) C−S Connecting In Game
500
300
300
100
100
0
0
0 400 800 1200 0 400 800 1200
Mean Packet Length (S−C) (bytes) Mean Packet Length (S−C) (bytes)
Fig. 17.22 Distribution of two ML features during four different periods of time for Wolfenstein
Enemy Territory server-to-client traffic
classification (which would occur inside the ISP’s own network, on one or more
dedicated traffic-monitoring boxes) from instantiation of the priority game traffic
handling (typically on the uplink side of the consumer’s home router). The decou-
pling ensured that the low-power consumer devices would not be burdened with
execution of the ML algorithms, and allows ISPs to provide a ‘value-added’ service
to owners of suitably ‘ANGEL-aware’ home routers or gateways. In early testing
the system would identify Wolfenstein Enemy Territory traffic with less than a sec-
ond of game-play traffic, and successfully reconfigure an otherwise-congested home
router. Within seconds of the game-play traffic ceasing, ANGEL would remove the
previously installed rules for prioritization of a particular UDP flow.
Statistical classification of game traffic is still in its early days. The likely scala-
bility and utility of this technique remain to be seen.
17.7 Conclusion
The game industry continues to evolves its techniques for extracting the most real-
istic ‘immersion’ experience for players given the vagaries on best-effort Internet
service. A key challenge for service providers is understanding the characteristics
of traffic imposed on networks by games, and their service quality requirements.
Interactive online games are particularly susceptible to the side effects of other
non-interactive (or delay- and loss-tolerant) traffic sharing next-generation access
links. This creates challenges out toward the edges, where high-speed home LANs
squeeze through broadband consumer access links to reach the Internet. We have
432 G. Armitage
identified a range of research work exploring many issues associated with the inter-
section of highly interactive games and the Internet, and hopefully stimulated some
further thinking along these lines.
Of course, space does not allow this chapter to represent the full spectrum of
issues associated with development and deployment of multiplayer online games.
Interested readers will find numerous relevant papers published in a number of con-
ferences whose scopes have expanded to include multiplayer online games in recent
years. Two notable examples are Netgames (Workshop on Network and System
Support for Games, http://www.netgames-conf.org/), and ACM NOSSDAV (Net-
work and Operating System Support for Digital Audio and Video, http://www.
nossdav.org/). Most data-networking and multimedia systems conferences run by
the ACM or IEEE have begun attracting games-related papers, which can be found
in the ACM Digital Library (http://portal.acm.org) and IEEEXplore (http://www.
ieee.org/web/publications/xplore/). The ACM in particular has a number of Special
Interest Groups (SIGs) whose scopes overlap the area of networking and online
games. Of particular interest are SIGCHI (Computer Human Interaction, http://
www.sigchi.org/), SIGCOMM (Data Communications, http://www.sigcomm.org/),
and SIGGRAPH (Graphics and Interactive Techniques, http://www.siggraph.org).
References
13. Tom Beigbeder, Rory Coughlan, Corey Lusher, John Plunkett, Emmanuel Agu, and Mark Clay-
pool. The effects of loss and latency on user performance in Unreal Tournament 2003. In
NetGames ’04: Proceedings of 3rd ACM SIGCOMM workshop on Network and system sup-
port for games, pages 144–151, New York, NY, USA, 2004. ACM.
14. M. Borella. Source models of network game traffic. Computer Communications, 23(3):
403–410, February 2000.
15. P. Branch and G. Armitage. Extrapolating server to client IP traffic from empirical measure-
ments of first person shooter games. In 5th Workshop on Network System Support for Games
2006 (Netgames2006), October 2006.
16. P. Branch and G. Armitage. Measuring the auto-correlation of server to client traffic in first per-
son shooter games. In Australian Telecommunications, Network and Applications Conference
(ATNAC), December 2006.
17. P. Branch, G. Armitage, and T. Cricenti. Time-series Modelling of Server to Client IP Packet
Length in First Person Shooter Games. In Proceedings of 15th IEEE International Conference
on Networks (ICON), Adelaide, Australia, November 2007.
18. P. Branch and T. Cricenti. ARMA(1,1) Modeling of Quake4 Server to Client Game Traffic. In
6th Workshop on Network System Support for Games 2007 (Netgames2007), September 2007.
19. P. Branch, T. Cricenti, and G. Armitage. A Markov Model of Server to Client IP traffic in First
Person Shooter Games. In Proceedings of 2008 IEEE International Conference on Communi-
cations (ICC2008), Beijing, China, May 2008.
20. J. But, T.T.T. Nguyen, L. Stewart, N. Williams, and G. Armitage. Peformance Analysis of the
ANGEL System for Automated Control of Game Traffic Prioritisation. In Proceedings of 6th
Annual Wokshop on Network and Systems Support for Games (Netgames 2007), Melbourne,
Australia, September 2007.
21. J. But, N. Williams, S. Zander, L. Stewart, and G. Armitage. ANGEL - Automated Network
Games Enhancement Layer. In Proceedings of Netgames 2006, Singapore, October 2006.
22. Centre for Advanced Internet Architectures. ANGEL - Automated Network Games Enhance-
ment Layer. http://caia.swin.edu.au/sitcrc/angel/, as of July 30th 2008.
23. C. Chambers, W.-C. Feng, S. Sahu, and D. Saha. Measurement based characterization of a
collection of on-line games. In Internet Measurement Conference 2005 (IMC2005), October
2005.
24. C. Chambers, W.-C. Feng, W.-C. Feng, and D. Saha. A geographic, redirection service for
on-line games. In ACM Multimedia 2003 (short paper), November 2003.
25. M. Claypool. Network characteristics for server selection in online games. In ACM/SPIE
Multimedia Computing and Networking (MMCN), January 2008.
26. Mark Claypool. The effect of latency on user performance in real-time strategy games. Comput.
Netw., 49(1):52–70, 2005.
27. Mark Claypool and Kajal Claypool. Latency and player actions in online games. Commun.
ACM, 49(11):40–45, 2006.
28. M. Duke, R. Braden, W. Eddy, and E. Blanton. A Roadmap for Transmission Control Protocol
(TCP) Specification Documents. RFC 4614 (Informational), September 2006.
29. J. Farber. Traffic modelling for fast action network games. Multimedia Tools and Applications,
23:31–46, December 2004.
30. W.-C. Feng, F. Chang, W.-C. Feng, and J. Walpole. Provisioning on-line games: A traffic
analysis of a busy Counter-Strike server. In SIGCOMM Internet Measurement Workshop, 2002.
31. W.-C. Feng, F. Chang, W.-C. Feng, and J. Walpole. A traffic charaterization of popular on-line
games. IEEE/ACM Transactions on Networking (TON), 13:488–500, June 2005.
32. T. Henderson and S. Bhatti. Modelling user behaviour in networked games. In 9th ACM
International Conference on Multimedia (ACM Multimedia), 2001.
33. T. Lang, G. Armitage, P. Branch, and H.-Y. Choo. A synthetic traffic model for Half-Life. In
Australian Telecommunications, Networks and Applications Conference (ATNAC), December
2003.
34. T. Lang, P. Branch, and G. Armitage. A synthetic model for Quake III traffic. In Advances in
Computer Entertainment (ACE2004), June 2004.
434 G. Armitage
35. Youngki Lee, Sharad Agarwal, Chris Butcher, and Jitu Padhye. Measurement and Estimation of
Network QoS Among Peer Xbox 360 Game Players. In Proc. 9th Passive and Active Network
Measurement Conference (PAM 2008), pages 41–50. Springer Berlin/Heidelberg, April 2008.
36. MaxMind. GeoLite Country. http://www.maxmind.com/app/geoip country, accessed February
8th 2008.
37. T.T.T. Nguyen and G. Armitage. Synthetic Sub-flow Pairs for Timely and Stable IP Traffic
Identification. In Proc. Australian Telecommunication Networks and Application Conference,
Melbourne, Australia, December 2006.
38. T.T.T. Nguyen and G. Armitage. Training on multiple sub-flows to optimise the use of Ma-
chine Learning classifiers in real-world IP networks. In Proc. IEEE 31st Conference on Local
Computer Networks, Tampa, Florida, USA, November 14–16 2006.
39. T.T.T. Nguyen and G. Armitage. A Survey of Techniques for Internet Traffic Classification
using Machine Learning. IEEE Communications Surveys & Tutorials, 10(4), October 2008.
40. T.T.T. Nguyen and G. Armitage. Clustering to Assist Supervised Machine Learning for Real-
Time IP Traffic Classification. In IEEE International Conference on Communications (ICC
2008), Beijing, China, May 2008.
41. James Nichols and Mark Claypool. The effects of latency on online madden NFL football.
In NOSSDAV 04: Proceedings of the 14th international workshop on Network and operating
systems support for digital audio and video, pages 146–151, New York, NY, USA, 2004. ACM
Press.
42. J. Postel. User Datagram Protocol. RFC 768 (Standard), August 1980.
43. J. Postel. Transmission Control Protocol. RFC 793 (Standard), September 1981. Updated by
RFC 3168.
44. Y. Rekhter, T. Li, and S. Hares. RFC 4271: A Border Gateway Protocol 4 (BGP-4), January
2006.
45. Nathan Sheldon, Eric Girard, Seth Borg, Mark Claypool, and Emmanuel Agu. The effect
of latency on user performance in Warcraft III. In NetGames ’03: Proceedings of the 2nd
workshop on Network and system support for games, pages 3–14, New York, NY, USA, 2003.
ACM.
46. Lawrence Stewart, Grenville Armitage, and Alana Huebner. Collateral Damage: The Impact of
Optimised TCP Variants On Real-time Traffic Latency in Consumer Broadband Environments.
In IFIP/TC6 NETWORKING 2009, Aachen, Germany, May 2009.
47. Valve Corporation. CounterStrike: Source. http://counter-strike.net/, accessed February 8th
2008.
48. Valve Corporation. Server Queries. http://developer.valvesoftware.com/wiki/Server Queries,
as of February 7th 2008.
49. Nigel Williams, Sebastian Zander, and Grenville Armitage. A preliminary performance
comparison of five machine learning algorithms for practical IP traffic flow classification.
SIGCOMM Comput. Commun. Rev., 36(5):5–16, 2006.
50. S. Zander. Misclassification of game traffic based on port numbers: A case study using enemy
territory. CAIA Technical Report 060410D, April 2006.
51. S. Zander and G. Armitage. A traffic model for the XBOX game Halo 2. In 15th ACM
International Workshop on Network and Operating System Support for Digital Audio and Video
(NOSSDAV2005), June 2005.
52. S. Zander, D. Kennedy, and G. Armitage. Dissecting server-discovery traffic patterns gener-
ated by multiplayer first person shooter games. In Proceedings of ACM Networks and System
Support for Games (NetGames) Workshop, New York, USA, October 2005.
53. Sebastian Zander, Ian Leeder, and Grenville Armitage. Achieving fairness in multiplayer net-
work games through automated latency balancing. In ACE ’05: Proceedings of the 2005 ACM
SIGCHI International Conference on Advances in computer entertainment technology, pages
117–124, New York, NY, USA, 2005. ACM.
Chapter 18
Wayfinding in Social Networks
David Liben-Nowell
18.1 Introduction
D. Liben-Nowell ()
Department of Computer Science, Carleton College, Northfield, MN 55057
e-mail: [email protected]
G. Cormode and M. Thottan (eds.), Algorithms for Next Generation Networks, 435
Computer Communications and Networks, DOI 10.1007/978-1-84882-765-3 18,
c Springer-Verlag London Limited 2010
436 D. Liben-Nowell
Although social networks have been implicit in the interactions of humans for mil-
lennia, and social interactions among humans have been studied by social scientists
for centuries, the academic study of social networks qua networks is more recent.
Some of the early foundational contributions date from the beginning of the twenti-
eth century, including the “web of group affiliations” of Georg Simmel [45], the
“sociograms” of Jacob Moreno [40], and the “topological psychology” of Kurt
Lewin [31]. In the 1950s, Cartwright and Zander [9] and Harary and Norman
[20] described an explicitly graph-theoretic framework for social networks: nodes
18 Wayfinding in Social Networks 437
neighbors of t) genuinely know their graph distance to t. Instead one must use some
guide other than graph distance to home in on t.
The key idea in routing in this context – frequently cited by the participants in
real small-world experiments as their routing strategy [13, 21] – is to use similarity
of characteristics (geographic location, hobbies, occupation, age, etc.) as a measure
of progress, a proxy for the ideal but unattainable graph-distance measure of prox-
imity. The success of this routing strategy hinges on the sociological observation
of the crucial tendency toward homophily in human relationships: the friends of a
typical person x tend to be similar to x. This similarity tends to occur with respect
to race, occupation, socioeconomics, and geography, among other dimensions. See
the survey of McPherson, Smith-Lovin, and Cook [36] for an excellent review of
the literature on homophily. Homophily makes characteristic-based routing reason-
able, and in fact it also gives one explanation for the high clustering of real social
networks: if x’s friends tend to be similar to x, then they also tend to be (somewhat
less) similar to each other, and therefore they also tend to know each other directly
with a (somewhat) higher probability than a random pair of people.
Homophily suggests a natural greedy algorithm for routing in social networks.
If a person s is trying to construct a path to a target t, then s should look at all of
his or her friends .s/ and, of them, select the friend in .s/ who is “most like”
the target t. This notion is straightforward when it comes to geography: the source s
knows both where his her friends live and where t lives, and thus s can just compute
the geographic distance between each u 2 .s/ and t, choosing the u minimizing
that quantity. Routing greedily with respect to occupation is somewhat murkier,
though one can imagine s choosing u based on distance within an implicit hierarchy
of occupations in s’s head. (Milgram’s stockbroker presumably falls into something
like the service industry ! financial services ! investment ! stocks.) Indeed, the
greedy algorithm is well founded as long as an individual has sufficient knowledge
of underlying person-to-person similarities to compare the distances between each
of his or her friends and the target.
Although homophily is a key motivation for greedy routing, homophily alone does
not suffice to ensure that the greedy algorithm will find short paths through a so-
cial network. As a concrete example, suppose that every sociologist studying social
networks knows every other such sociologist and nobody else, and every computer
scientist studying social networks knows every other such computer scientist and
nobody else. This network has an extremely high degree of homophily. But the net-
work is not even connected, let alone navigable by the greedy algorithm. For the
greedy algorithm to succeed, the probability of friendship between people u and v
should somehow vary more smoothly as the similarity of u and v decreases. Intu-
itively, there is a tension between having “well-scattered” friends to reach faraway
targets and having “well-localized” friends to home in on nearby targets. Without
18 Wayfinding in Social Networks 439
the former, a large number of steps will be required to span the large gap from
a source s to an especially dissimilar target t; without the latter, similarity will be
only vaguely related to graph-distance proximity, and thus the greedy algorithm will
be a poor approximation to a globally aware shortest-path algorithm.
A rigorous form of this observation was made by Jon Kleinberg [23,26], through
formal analysis of this trade-off in an elegant model of social networks. Here is
Kleinberg’s model, in its simplest form (see Section 18.6 for generalizations). Con-
sider an n-person population, and arrange these people as the points in a regular
k-dimensional grid. Each person u in the network is connected to 2k “local neigh-
bors,” the people who live one grid point above and below u in each of the k cardinal
directions. (People on the edges of the grid will have fewer local neighbors, or we
can treat the grid as a torus without substantively affecting the results.) Each per-
son u will also be endowed with a “long-range link” to one other person v in the
network. That person v will be chosen probabilistically with PrŒu ! v / d.u; v/˛ ,
where d.; / denotes Manhattan distance in the grid and ˛ 0 is a parameter to the
model. (Changing the model to endow each person with any constant number of
long-range links does not qualitatively change the results.) See Figure 18.1 for an
example network, with k D ˛ D 2. Notice that the parameter ˛ operationalizes
the trade-off between highly localized friends and highly scattered friends: setting
˛ D 0 yields links from each person u to a person v chosen uniformly from the
network, while letting ˛ ! 1 yields links from u to v only if d.u; v/ D 1.
A local-information algorithm is one that computes a path to a target without
global knowledge of the graph. When a person u chooses a next step v in the path
to the target t, the person u has knowledge of the structure of the grid, including
the grid locations of u herself, u’s local neighbors, u’s long-range contact, and the
Fig. 18.1 Kleinberg’s small-world model [23, 26]. A population of n people is arranged on a k-
dimensional grid, and each person u is connected to her immediate neighbors in each direction.
Each person u is also connected to a long-range friend v, chosen with probability / d.u; v/˛ ,
where d.; / denotes Manhattan distance and ˛ 0 is a parameter to the model. The example
two-dimensional network here was generated with ˛ D 2
440 D. Liben-Nowell
target t. However, the remaining structure of the graph – that is, the long-range links
for nodes other than u – are not available to u when she is making her routing choice.
(The results are not affected by expanding the knowledge of each node u to include
the list of all people previously on the path from the original source s to u, or even
the list of long-range links for each of those people.)
Kleinberg was able to give a complete characterization of the navigability of
these networks by local-information algorithms:
Theorem 18.1 (Kleinberg [23, 26]). Consider an n-person network with people
arranged in a k-dimensional grid, where each person has 2k local neighbors and
one long-range link chosen with parameter ˛ 0, so that PrŒu ! v / d.u; v/˛ .
For an arbitrary source person s and an arbitrary target person t:
If ˛ ¤ k, then there exists some constant " > 0, where " depends on ˛ and k but
is independent of n, such that the expected length of the path from s to t found by
any local-information algorithm is ˝.n" /.
If ˛ D k, then the greedy algorithm – i.e., the algorithm that chooses the next
step in the path as the contact closest to the target t under Manhattan distance
in the grid – finds a path from s to t of expected length O.log2 n/.
The proof that greedy routing finds a path of length O.log2 n/ when ˛ D k proceeds
by showing that the probability of halving the distance to the target at any step of the
path is ˝.1= log n/. Thus, in expectation, the distance to the target is halved every
O.log n/ steps. The path reaches the target after the distance is halved log n times,
and therefore O.log2 n/ total steps suffice to reach the target in expectation.
For our purposes, we will broadly treat paths of length polynomial in the log-
arithm of the population size as “short,” and paths of length polynomial in the
population size as “long.” (We will use standard terminology in referring to these
“short” paths as having polylogarithmic length – that is, length O.logc n/ for some
constant exponent c, in a population of size n.) There has been significant work
devoted to tightening the analysis of greedy routing in Kleinberg’s networks – for
example, [7, 35] – but for now we will focus on the existence of algorithms that
find paths of polylogarithmic length, without too much concern about the precise
exponent of the polynomial. A network in which a local-information algorithm can
find a path of polylogarithmic length is called navigable. Theorem 18.1, then, can
be rephrased as follows: a k-dimensional grid-based social network with parameter
˛ is navigable if and only if k D ˛.
Note that this definition of navigability, and hence Kleinberg’s result, describes
routing performance asymptotically in the population size n. Real networks, of
course, are finite. Aaron Clauset and Cristopher Moore [11] have shown via sim-
ulation that in finite networks greedy routing performs well even under what
Theorem 18.1 identifies as “non-navigable” values of ˛. Following [11], define ˛opt
as the value of ˛ that produces the network under which greedy routing achieves the
shortest path lengths. Clauset and Moore’s simulations show that ˛opt is somewhat
less than k in large but finite networks; furthermore, although ˛opt approaches k as
the population grows large, this convergence is relatively slow.
18 Wayfinding in Social Networks 441
Fig. 18.2 The LiveJournal social network [32]. A dot is shown for each geographic location that
was declared as the hometown of at least one of the
500,000 LiveJournal users whom we were
able to locate at a longitude and latitude in the continental USA. A random 0.1% of the friendships
in the network are overlaid on these locations
the resolution of geographic locations is limited to the level of towns and cities,
we try only to reach the city of the target t rather than t herself. We found that,
subject to these caveats, the geographic greedy algorithm was able to find short
paths connecting many pairs of people in the network (see [32] for more detail).
With the above observations (people are arranged on a two-dimensional geo-
graphic grid; greedy routing based on geography finds short paths through the
network) and Theorem 18.1, we set out – in retrospect, deeply naı̈vely – to verify
that the probability of friendship between people u and v grows asymptotically as
d.u; v/2 in the LiveJournal network. In other words, in the language of Kleinberg’s
theorem, we wanted to confirm that ˛ D 2. The results are shown in Figure 18.3,
which displays the probability P .d / of friendship between two people who live
a given distance d apart – i.e., the fraction of pairs separated by distance d who
declare a friendship in LiveJournal.
One immediate observation from the plot in Figure 18.3 is that the probability
P .d / of friendship between two people in LiveJournal separated by distance d re-
ally does decrease smoothly and markedly as d increases. This relationship already
reveals a nonobvious fact about LiveJournal; there was no particular reason to think
that geographic proximity would necessarily play an important role in friendships in
a completely virtual community like this one. Section 18.7 includes some discussion
of a few possible reasons why geography remains so crucial in this virtual setting,
but for now it is worth noting that the “virtualization” of real-world friendships
(i.e., the process of creating digital records of existing physical-world friendships)
seems to explain only some of the role of geography. For example, it seems hard for
18 Wayfinding in Social Networks 443
10-3
10-4
link probability
10-5
10-6
101 102 103
separating distance (kilometers)
Fig. 18.3 The probability P .d / of a friendship between two people in LiveJournal as a function
of the geographic distance d between their declared hometowns [32]. Distances are rounded into
10-km buckets. The solid line corresponds to P .d / / 1=d . Note that Theorem 18.1 requires
P .d / / 1=d 2 for a network of people arranged in a regular two-dimensional grid to be navigable
this process to fully account for the marked difference in link probability between
people separated by 300 versus 500 km, a range at which regular physical-world
interactions seem unlikely.
A second striking observation from the plot in Figure 18.3 is that the proba-
bility P .d / of friendship between people separated by distance d is very poorly
modeled by P .d / / 1=d 2 , the relationship required by Theorem 18.1. This prob-
ability is better modeled as P .d / / 1=d , and in fact is even better modeled as
P .d / D " C
.1=d /, for a constant " 5:0 106 . Apropos the discussion in the
previous paragraph, this additive constant makes some sense: the probability that
people u and v are friends can be thought of as the sum of two probabilities, one that
increases with their geographic proximity, and one that is independent of their ge-
ographic locations. But, regardless of the presence or absence of the additive ", the
plot in Figure 18.3 does not match – or even come close to matching – the navigable
exponent required by Kleinberg’s theorem.
Similar results have also been observed in another social-networking context. In
a study of email-based social links among about 450 members of Hewlett–Packard
Research Labs [1], Lada Adamic and Eytan Adar found that the link probability
P .d / between two HP Labs researchers was also closely matched by P .d / / 1=d ,
where d measures the Manhattan distance between the cubicle locations of the em-
ployees. In this setting, too, geographic greedy routing found short paths to most
targets – though not as short as those found by routing greedily according to proxim-
ity in the organizational hierarchy of the corporation (see Sections 18.6 and 18.7) –
again yielding a greedily navigable network that does not match Theorem 18.1.
444 D. Liben-Nowell
The observations from the previous section lead to a seeming puzzle: a navigable
two-dimensional grid, which must have link probabilities decaying as 1=d 2 to be
navigable according to Theorem 18.1, has link probabilities decaying as 1=d . But
another look at Figure 18.2 reveals an explanation – and reveals the naı̈vete of look-
ing for P .d / / 1=d 2 in the LiveJournal network. Although a two-dimensional grid
is a reasonable model of geographic location, a uniformly distributed population on
a two-dimensional grid is a very poor model of the geographic distribution of the
LiveJournal population. Population density varies widely across the USA – from
over 10,000 people/km2 in parts of Manhattan to approximately 1 person/km2 in
places like Lake of the Woods County, in the far northern reaches of Minnesota.
Two Manhattanites who live 500 m apart have probably never even met; two Lake
of the Woods residents who live 500 m apart are probably next-door neighbors, and
thus they are almost certain to know each other. This wide spectrum suggests that
distance cannot be the whole story in any reasonable geographic model of social
networks: although PrŒu ! v should be a decreasing function of the geographic
distance between u and v, intuitively the rate of decrease in that probability should
reflect something about the population in the vicinity of these people.
One way to account for variable population density is rank-based friendship [6,
28, 32]. The grid-based model described here is the simplest version of rank-based
friendship; as with Kleinberg’s distance-based model, generalizations that do not
rely on the grid have been studied (see Section 18.6). We continue to measure
person-to-person distances using Manhattan distance in a k-dimensional grid, but
we will now allow an arbitrary positive number of people to live at each grid point.
Each person still has 2k local neighbors, one in each of the two directions in each
of the k dimensions, and one long-range link, chosen as follows. Define the rank
of a person v with respect to u as the number of people who live at least as close
to u as v does, breaking ties in some consistent way. (In other words, person u sorts
the population in descending order of proximity to u; the rank of v is her index in
this sorted list.) Now each person u chooses her long-range link according to rank,
so that PrŒu ! v is inversely proportional to the rank of v with respect to u. See
Figure 18.4 for an example rank-based network.
Rank-based friendship generalizes the navigable ˛ D k setting in the distance-
based Theorem 18.1: in a k-dimensional grid with constant population at each point,
the rank of v with respect to u is
.d.u; v/k /. But even under nonuniform population
densities, social networks generated according to rank-based friendship are naviga-
ble by the greedy algorithm:
Theorem 18.2 (Liben-Nowell, Novak, Kumar, Raghavan, Tomkins [28, 32]).
Consider an n-person network where people are arranged in a k-dimensional grid
so that at least one person lives at every grid point x. Suppose each person has 2k
local neighbors and one long-range link chosen via rank-based friendship. Fix any
source person s and choose a target person t uniformly at random from the popula-
tion. Then under greedy routing the expected length of the path from s to the point xt
in which t lives is O.log3 n/.
18 Wayfinding in Social Networks 445
(a) Concentric balls around a city C , where each (b) A social network generated from this pop-
ball’s population increases by a factor of four. A ulation distribution via rank-based friendship.
resident of C choosing a rank-based friend is four For visual simplicity, edges are depicted as
times more likely to choose a friend at the bound- connecting cities; the complete image would
ary of one ball than a friend at the boundary of the show each edge connecting one resident from
next-larger ball. each of its endpoint cities.
Fig. 18.4 Two images of a sample rank-based social network with variable population density.
Each blue circle represents a city with a population whose size is proportional to the circle’s ra-
dius. Distances between cities, and hence between people, are computed using Manhattan distance.
A rank-based friendship for each person u is formed probabilistically, where PrŒu ! v is inversely
proportional to the number of people who live closer to u than v is, breaking ties consistently. The
local neighbors – for each person u, one friend in the neighboring city in each cardinal direction –
are not shown
A few notes about this theorem are in order. First, relative to Theorem 18.1, rank-
based friendship has lost a logarithmic factor in the length of the path found by
greedy routing. Recently, in joint work with David Barbella, George Kachergis,
Anna Sallstrom, and Ben Sowell, we were able to show that a “cautious” variant on
greedy routing finds a path of expected length O.log2 n/ in rank-based networks [6],
but the analogous tightening for greedy routing itself remains open.
Second, Theorem 18.2 makes a claim about the expected length of the path found
by the greedy algorithm for a randomly chosen target t, where the expectation is
taken over both the random construction of the network and the random choice of
the target. In contrast, Theorem 18.1 makes a claim about the expected length of the
path found by the greedy algorithm for any target, where the expectation is taken
only over the random construction of the network. Intuitively, some targets in a rank-
based network may be very difficult to reach: if a person t lives in a region of the
network that has a comparatively very sparse population, then there will be very few
long-range links to people near t. Thus making progress toward an isolated target
may be very difficult. However, the difficulty of reaching an isolated target like t is
offset by the low probability of choosing such a target; almost by definition, there
cannot be very many people who live in regions of the network that have unusually
low density. The proof of Theorem 18.2 formalizes this intuition [28, 32].
446 D. Liben-Nowell
This technical difference in the statements of Theorems 18.1 and 18.2 in fact
echoes points raised by Judith Kleinfeld in her critique of the overly expansive
interpretation of Milgram’s experimental results [27]. Milgram’s stockbroker was
a socially prominent target, and other Milgram-style studies performed with less
prominent targets – e.g., the wife of a Harvard Divinity School student, in one study
performed by Milgram himself – yielded results much less suggestive of a small
world.
It is also worth noting that, although the “isolated target” intuition suggests
why existing proof techniques are unlikely to yield a “for all targets” version of
Theorem 18.2, there are no known population distributions in which greedy routing
fails to find a short path to any particular target in a rank-based network. It is an
interesting open question to resolve whether there are population distributions and
source–target pairs for which greedy routing fails to find a path of short expected
length in rank-based networks (where, as in Theorem 18.1, the expectation is taken
only over the construction of the network).
These two ways in which Theorem 18.2 is weaker than Theorem 18.1 are coun-
terbalanced by the fact that Theorem 18.2 can handle varying population densities,
but the real possible benefit is the potential for a better fit with real data. Figure 18.5
is the rank analogue of Figure 18.3: for any rank r, the fraction of LiveJournal users
who link to their rth-most geographically proximate person is displayed. (Some av-
eraging has been done in Figure 18.5: because a random person in the LiveJournal
network lives in a city with about 1,300 residents, the data do not permit us to
adequately distinguish among ranks that differ by less than this number.)
Fig. 18.5 The probability P .r/ of a friendship between two people u and v in LiveJournal as a
function of the rank r of v with respect to u [32]. Ranks are rounded into buckets of size 1,300,
which is the average LiveJournal population of the city for a randomly chosen person in the net-
work, and thus 1,300 is in a sense the “rank resolution” of the dataset. (The unaveraged data
are noisier, but follow the same trend.) The solid line corresponds to P .r/ / 1=r. Note that
Theorem 18.2 requires P .r/ / 1=r for a rank-based network to be navigable
18 Wayfinding in Social Networks 447
As it was with distance, the link probability P .r/ between two people is a
smoothly decreasing function of the rank r of one with respect to the other. And
just as before, link probability levels off to about " D 5:0 106 as rank gets large,
so P .r/ is well modeled by P .r/ D
.1=r/ C ". But unlike the distance-based
model of Figure 18.3 and Theorem 18.1, the fit between Figure 18.5 and Theorem
18.2 is notable: people in the LiveJournal network really have formed links with a
geographic distribution that is a remarkably close match to rank-based friendship.
Until now, our discussion has concentrated on models of proximity that are based
on Manhattan distance in an underlying grid. We have argued that these grid-based
models are reasonable for geographic proximity. Even in the geographic context,
though, they are imperfect: the two-dimensional grid fails to account for real-world
geographic features like the third dimension of a high-rise apartment complex or the
imperfect mapping between geographic distance and transit-time distance between
two points. But in a real Milgram-style routing experiment, there are numerous other
measures of proximity that one might use as a guide in selecting the next step to-
ward a target: occupation, age, hobbies, and alma mater, for example. The grid is
a very poor model for almost all of these notions of distance. In this section, we
will consider models of social networks that better match these non-geographic no-
tions of similarity. Our discussion will include both non-grid-based models of social
networks and ways to combine multiple notions of proximity into a single routing
strategy.
school” like liberal arts college or research university, athletic conference, strength
of computer science department, etc. But even with these complications, similarity
of occupation, hobbies, or alma mater is more naturally modeled with a hierarchy
than with a grid.
Navigability in social networks derived from a hierarchical metric has been ex-
plored through analysis, simulation, and empirical study of real-world interactions.
Kleinberg has shown a similar result to Theorem 18.1 for the tree-based setting,
characterizing navigable networks in terms of a single parameter that controls how
rapidly the link probability between people drops off with their distance [24]. As in
the grid, Kleinberg’s theorem identifies an optimal middle ground in the trade-off
between having overly parochial and overly scattered connections: if T is a regular
b-ary tree and PrŒu ! v / b ˇ lca.u;v/ , then the network is navigable if and only
if ˇ D 1. Watts, Dodds, and Newman [48] have explored a similar hierarchical
setting, finding the ranges of parameters that were navigable in simulations. (Their
focus was largely on the combination of multiple hierarchical measures of proxim-
ity, an issue to which we will turn shortly.) Routing in the hierarchical context has
also been studied empirically by Adamic and Adar, who considered the role of prox-
imity in the organizational structure of Hewlett–Packard Labs in social links among
HP Labs employees [1]. (Because a company’s organizational structure forms a tree
where people more senior in the organization are mapped to internal nodes instead of
to leaves, Adamic and Adar consider a minor variation on LCA to measure person-
to-person proximity.) Adamic and Adar found that, as with geography, there is a
strong trace of organizational proximity in observed connections, and that, again as
with geography, greedily routing toward a target based on organizational proximity
was generally effective (see Section 18.7 for some discussion).
The question of navigability of a social network derived from an underlying
measure of distance has also been explored beyond the contexts of the grid and
the tree. Many papers have considered routing in networks in which person-to-
person distances are measured by shortest-path distances in an underlying graph that
has some special combinatorial structure. These papers then typically state bounds
on navigability that are based on certain structural parameters of the underlying
graph; examples include networks that have low treewidth [16], bounded growth
rate [14, 15, 42], or low doubling dimension [19, 47]. The results on rank-based
friendship, including generalizations and improvements on Theorem 18.2, have also
been extended to the setting of low doubling dimension [6,28]. However, a complete
understanding of the generality of these navigability results in terms of properties
of the underlying metric remains open.
Another way to model person-to-person proximity – and also to model varia-
tion in population density, in a different way from rank-based friendship – is the
very general group-structure model of Kleinberg [24]. Each person in an n-person
population is a member of various groups (perhaps defined by a shared physical
neighborhood, an employer, a hobby), and PrŒu ! v is a decreasing function of
the size of the smallest group containing both u and v. Kleinberg proved that the
resulting network is navigable if PrŒu ! v is inversely proportional to the size of
the smallest group including both u and v, subject to two conditions on the groups.
18 Wayfinding in Social Networks 449
Informally, these conditions are the following. First, every group g must be “cov-
ered” by relatively large subgroups (so that once a path reaches g it can narrow in
on a smaller group containing any particular target t). Second, groups must satisfy a
sort of “bounded growth” condition (so that a person u has only a limited number of
people who are in a group of any particular size with u, and thus u has a reasonable
probability of “escaping” from small groups to reach a faraway target t).
These simulations show that, for these values of k, the resulting network appears to
be searchable under a broader range of parameters for the function giving friend-
ship probability as a function of distance. As with Theorem 18.1, there is provably a
single exponent ˇ D 1 under which greedy routing produces polylogarithmic paths
when there is one hierarchy; for two or three hierarchies, these simulations showed
a wider range of values of ˇ that yield navigable networks.
The results in the Watts–Dodds–Newman setting are based on simulations, and
giving a fully rigorous theoretical analysis of routing in this context remains an
interesting open challenge. So too do a variety of generalizations of that setting:
dependent hierarchies, a combination of grid-based and hierarchy-based measures
of proximity, or the incorporation of variable population density into the multiple-
hierarchy setting. Broader modeling questions remain open, too. One can conceive
of subtler ways of combining multiple dimensions of similarity that seem more
realistic than just the sum or the minimum. For example, it seems that making sig-
nificant progress toward a target in one dimension of similarity at the expense of
large decreases in similarity in several other dimensions is a routing mistake, even if
it reduces the minimum distance to the target over all the dimensions. Realistically
modeling these multidimensional scenarios is an interesting open direction.
The generalizations that we have discussed so far are based on extending greedy
routing to broader and more realistic notions of proximity, but one can also consider
enriching the routing algorithm itself. For example, algorithms that endow individ-
uals with additional “semi-local” information about the network, such as awareness
of one’s friends’ friends, have also been studied (e.g., [18, 30, 34, 35, 47]). But there
is another natural and simple consideration in Milgram-style routing that we have
not mentioned thus far: some people have more friends than others. This is a signif-
icant omission of the models that we have discussed; people in these models have
a constant or nearly constant number of friends. In contrast, degrees in real social
networks are usually well modeled by a power-law distribution, in which the propor-
tion of the population with f friends is approximately 1=f , where is a constant
typically around 2.1–2.4 in real networks (see, e.g., [5, 8, 12, 29, 39]). In the routing
context, a popular person can present a significant advantage in finding a shorter
path to the target. A person with more friends has a higher probability of knowing
someone who is significantly closer to any target – in virtue of having drawn more
samples from the friendship distribution – and thus a more popular person will be
more likely to find a shorter path to a given target.
Strategies that choose high-degree people in routing have been studied in a num-
ber of contexts, and, largely through simulation, these strategies have been shown to
perform reasonably well [1–3, 22, 46]. Of these, perhaps the most promising algo-
rithm for homophilous power-law networks is the expected-value navigation (EVN)
algorithm of Şimşek and Jensen [46], which explicitly combines popularity and
18 Wayfinding in Social Networks 451
proximity in choosing the next step in a chain. Under EVN, the current node u
chooses as the next node in the path its neighbor v whose probability of a direct link
to the target is maximized. The node u computes this probability using the knowl-
edge of v’s proximity to t as well as v’s outdegree ıv . An underlying model like the
grid, for example, describes the probability pv that a particular one of v’s friendships
will connect v to t; one can then compute the probability 1 .1 pv /ıv that one of
the ıv friendships of v will connect v to t. EVN chooses the friend maximizing this
probability as the next step in the chain. Although Şimşek and Jensen give empirical
evidence for EVN’s success, no theoretical analysis has been performed. Analyzing
this algorithm – or other similar algorithms that incorporate knowledge of node de-
gree in addition to target proximity – in a formal setting is an important and open
problem. Although a precise rigorous account of EVN has not yet been given, it
is clear that EVN captures something crucial about real routing: the optimal rout-
ing strategy is some combination of getting close to a target in terms of similarity
(the people who are more likely to know others most like the target) and of getting
to popular intermediate people who have a large social circle (the people who are
more likely to know many others in general). The interplay between popularity and
proximity – and incorporating richer notions of proximity into that understanding –
is a rich area for further research.
18.7 Discussion
It is clear that the wayfinding problem for real people in real social networks is
only approximated by the models of social networks and of social-network rout-
ing discussed in this chapter. In many ways, real wayfinding is easier than it is in
these models: we know which of our friends lived in Japan for a year, or tend to
be politically conservative, or have a knack for knowing people in many walks of
life, and we also have some intuitive sense of how to weight these considerations
in navigating the network toward a particular target person. But real wayfinding is
harder for real people in many ways, too: for example, even seemingly simple ge-
ography-based routing is, at best, a challenge for the third of college-age Americans
who were unable to locate Louisiana on a map of the USA, even after the extensive
press coverage of Hurricane Katrina [43].
The models of similarity and network knowledge that we have considered here
are simplistic, and studying more realistic models – models with richer notions of
proximity, or models of the errors or inconsistencies in individuals’ mental maps of
these notions of proximity, for example – is very interesting. But there is, of course,
a danger of trying to model “too well”: the most useful models do not reproduce all
of the fine-grained details of a real-world phenomenon, but rather shed light on that
phenomenon through some simple and plausible explanation of its origin.
With this perspective in mind, I will highlight just one question here: why and
how do social networks become navigable? A number of models of the evolution
of social networks through the “rewiring” of long-range friendships in a grid-like
452 D. Liben-Nowell
setting have been defined and analyzed [10, 11, 44]; these authors have shown that
navigability emerges in the network when this rewiring is done appropriately. We
have seen here that rank-based friendship is another way to explain the navigabil-
ity of social networks, and we have seen that friendships in LiveJournal, viewed
geographically, are well approximated by rank-based friendship. One piece is miss-
ing from the rank-based explanation, though: why is it that rank-based friendship
should hold in a real social network, even approximately? Figure 18.3 shows that
geography plays a remarkably large role in friendships even in LiveJournal’s purely
virtual community; friendship probability drops off smoothly and significantly as
geographic proximity decreases. Furthermore, Figure 18.5 shows that rank-based
friendship is a remarkably accurate model of friendship in this network. But are
there natural processes that can account for this behavior? Why should geographic
proximity in the flesh-and-blood world resonate so much in the virtual world of
LiveJournal? And why should this particular rank-based pattern hold?
One explanation for the important role of geography in LiveJournal is that a
significant number of LiveJournal friendships are online manifestations of exist-
ing physical-world friendships, which crucially rely on geographic proximity for
their formation. This “virtualization” is undoubtedly an important process by which
friendships appear in an online community like LiveJournal, and it certainly ex-
plains some of geography’s key role. But accounting for the continued slow decay
in link probability as geographic separation increases from a few hundred kilome-
ters to a thousand kilometers, beyond the range of most spontaneous physical-world
interactions, seems to require some additional explanation. Here is one specula-
tive possibility: many interests held by LiveJournal users have natural “geographic
centers” – for example, the city where a professional sports team plays, or the town
where a band was formed, or the region where a particular cuisine is popular. Shared
interests form the basis for many friendships. The geographic factor in LiveJournal
could perhaps be explained by showing that the “mass” of u and v’s shared interests
(appropriately defined) decays smoothly as the geographic distance between u and v
increases. Recent work of Backstrom et al. [4] gives some very intriguing evidence
related to this idea. These authors have shown results on the geographic distribution
of web users who issue various search queries. They characterize both the geo-
graphic “centers” of particular search queries and the “spread” of those queries, in
terms of how quickly searchers’ interest in that query drops off with the geographic
distance from the query’s center. Developing a comprehensive model of friendship
formation on the basis of this underlying geographic nature of interests is a very
interesting direction for future work.
To close, I will mention one interesting perspective on the question of an under-
lying mechanism by which rank-based friendship might arise in LiveJournal. This
perspective comes from two other studies of node-linking behavior as a function
of node-to-node similarity, in two quite different contexts. Figure 18.6(b) shows
the results of the study by Adamic and Adar [1] of the linking probability between
HP Labs employees as a function of the distance between them in the corporate
hierarchy. Their measure of similarity is a variant of LCA, modified to allow the cal-
culation of distances to an internal node representing a manager in the organization.
18 Wayfinding in Social Networks 453
Fig. 18.6 Three plots of distance versus linking probability: (a) the role of geographic distance
between LiveJournal users [32], a reproduction of Figure 18.3; (b) the role of corporate-hierarchy
distance between HP Labs employees, from a study by Lada Adamic and Eytan Adar [1]; and (c)
the role of lexical distance between pages on the web, from a study by Filippo Menczer [37]
454 D. Liben-Nowell
based on the lexical distance of the pages’ content. Because the raw link probabili-
ties are so small, the plot shows the probability that the neighborhoods of two pages
have nonempty overlap, where a page p’s neighborhood consists of the page p itself,
the pages to which p has a hyperlink, and pages that have a hyperlink to p.
Intriguingly, the LiveJournal linkage pattern, reproduced as Figure 18.6(a), and
the HP Labs plot in Figure 18.6(b) show approximately the same characteristic
shape in their logarithmic plots: a linear decay in link probability for comparatively
similar people, leveling off to an approximately constant link probability for com-
paratively distant pairs. Figure 18.6(c) shows the opposite pattern: the probability
of connection between two comparatively similar web pages is roughly constant,
and then begins to decay linearly (in the log–log plot) once the pages’ similarity
drops beyond a certain level. Figure 18.6(a) and (b) both plot link probability be-
tween people in a social network against their (geographic or corporate) distance;
Figure 18.6(c) plots link probability for web pages. Understanding why linking pat-
terns in social networks look different from the web – and, more generally, making
sense of what might be generating these distributions – remains a fascinating open
question.
Acknowledgements Thanks to Lada Adamic and Filippo Menczer for helpful discussions and
for providing the data used to generate Figure 18.6(b) and (c). I would also like to thank the
anonymous referees for their very helpful comments. This work was supported in part by NSF
grant CCF-0728779 and by grants from Carleton College.
References
1. Lada A. Adamic and Eytan Adar. How to search a social network. Social Networks, 27(3):
187–203, July 2005.
2. Lada A. Adamic, Rajan M. Lukose, and Bernardo A. Huberman. Local search in unstructured
networks. In Handbook of Graphs and Networks. Wiley-VCH, 2002.
3. Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Huberman. Search in
power-law networks. Physical Review E, 64(046135), 2001.
4. Lars Backstrom, Jon Kleinberg, Ravi Kumar, and Jasmine Novak. Spatial variation in
search engine queries. In Proceedings of the 17th International World Wide Web Conference
(WWW’08), pages 357–366, April 2008.
5. Albert-László Barabási and Eric Bonabeau. Scale-free networks. Scientific American, 288:
50–59, May 2003.
6. David Barbella, George Kachergis, David Liben-Nowell, Anna Sallstrom, and Ben Sowell.
Depth of field and cautious-greedy routing in social networks. In Proceedings of the 18th Inter-
national Symposium on Algorithms and Computation (ISAAC’07), pages 574–586, December
2007.
7. Lali Barrière, Pierre Fraigniaud, Evangelos Kranakis, and Danny Krizanc. Efficient routing in
networks with long range contacts. In Proceedings of the 15th International Symposium on
Distributed Computing (DISC’01), pages 270–284, October 2001.
8. Béla Bollobás, Oliver Riordan, Joel Spencer, and Gábor Tusnády. The degree sequence of a
scale-free random graph process. Random Structures and Algorithms, 18(3):279–290, May
2001.
9. Dorwin Cartwright and Alvin Zander. Group Dynamics: Research and Theory. Row, Peterson,
1953.
18 Wayfinding in Social Networks 455
10. Augustin Chaintreau, Pierre Fraigniaud, and Emmanuelle Lebhar. Networks become navigable
as nodes move and forget. In Proceedings of the 35th International Colloquium on Automata,
Languages and Programming (ICALP’08), pages 133–144, July 2008.
11. Aaron Clauset and Cristopher Moore. How do networks become navigable? Manuscript, 2003.
Available as cond-mat/0309415.
12. Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. Power-law distributions in
empirical data. Manuscript, 2007. Available as arXiv:0706.1062.
13. Peter Sheridan Dodds, Roby Muhamad, and Duncan J. Watts. An experimental study of search
in global social networks. Science, 301:827–829, 8 August 2003.
14. Philippe Duchon, Nicolas Hanusse, Emmanuelle Lebhar, and Nicolas Schabanel. Could any
graph be turned into a small world? Theoretical Computer Science, 355(1):96–103, 2006.
15. Philippe Duchon, Nicolas Hanusse, Emmanuelle Lebhar, and Nicolas Schabanel. Towards
small world emergence. In Proceedings of the 18th ACM Symposium on Parallelism in Algo-
rithms and Architectures (SPAA’06), pages 225–232, August 2006.
16. Pierre Fraigniaud. Greedy routing in tree-decomposed graphs. In Proceedings of the 13th
Annual European Symposium on Algorithms (ESA’05), pages 791–802, October 2005.
17. Pierre Fraigniaud. Small worlds as navigable augmented networks: Model, analysis, and val-
idation. In Proceedings of the 15th Annual European Symposium on Algorithms (ESA’07),
pages 2–11, October 2007.
18. Pierre Fraigniaud, Cyril Gavoille, and Christophe Paul. Eclecticism shrinks even small worlds.
In Proceedings of the 23rd Symposium on Principles of Distributed Computing (PODC’04),
pages 169–178, July 2004.
19. Pierre Fraigniaud, Emmanuelle Lebhar, and Zvi Lotker. A doubling dimension threshold
.log log n/ for augmented graph navigability. In Proceedings of the 14th Annual European
Symposium on Algorithms (ESA’06), pages 376–386, September 2006.
20. Frank Harary and Robert Z. Norman. Graph Theory as a Mathematical Model in Social Sci-
ence. University of Michigan, 1953.
21. P. Killworth and H. Bernard. Reverse small world experiment. Social Networks, 1:159–192,
1978.
22. B. J. Kim, C. N. Yoon, S. K. Han, and H. Jeong. Path finding strategies in scale-free networks.
Physical Review E, 65(027103), 2002.
23. Jon Kleinberg. The small-world phenomenon: An algorithmic perspective. In Proceedings of
the 32nd Annual Symposium on the Theory of Computation (STOC’00), pages 163–170, May
2000.
24. Jon Kleinberg. Small-world phenomena and the dynamics of information. In Advances in
Neural Information Processing Systems (NIPS’01), pages 431–438, December 2001.
25. Jon Kleinberg. Complex networks and decentralized search algorithms. In International
Congress of Mathematicians (ICM’06), August 2006.
26. Jon M. Kleinberg. Navigation in a small world. Nature, 406:845, 24 August 2000.
27. Judith Kleinfeld. Could it be a big world after all? The “six degrees of separation” myth.
Society, 39(61), April 2002.
28. Ravi Kumar, David Liben-Nowell, and Andrew Tomkins. Navigating low-dimensional and
hierarchical population networks. In Proceedings of the 14th Annual European Symposium on
Algorithms (ESA’06), pages 480–491, September 2006.
29. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins, and
Eli Upfal. Stochastic models for the web graph. In Proceedings of the 41st IEEE Symposium
on Foundations of Computer Science (FOCS’00), pages 57–65, November 2000.
30. Emmanuelle Lebhar and Nicolas Schabanel. Close to optimal decentralized routing in long-
range contact networks. In Proceedings of the 31st International Colloquium on Automata,
Languages and Programming (ICALP’04), pages 894–905, July 2004.
31. Kurt Lewin. Principles of Topological Psychology. McGraw Hill, 1936.
32. David Liben-Nowell, Jasmine Novak, Ravi Kumar, Prabhakar Raghavan, and Andrew
Tomkins. Geographic routing in social networks. Proceedings of the National Academy of
Sciences, 102(33):11623–11628, August 2005.
456 D. Liben-Nowell
33. Kevin Lynch. The Image of the City. MIT Press, 1960.
34. Gurmeet Singh Manku, Moni Naor, and Udi Wieder. Know thy neighbor’s neighbor: the power
of lookahead in randomized P2P networks. In Proceedings of the 36th ACM Symposium on
Theory of Computing (STOC’04), pages 54–63, June 2004.
35. Chip Martel and Van Nguyen. Analyzing Kleinberg’s (and other) small-world models. In
Proceedings of the 23rd Symposium on Principles of Distributed Computing (PODC’04), pages
179–188, July 2004.
36. Miller McPherson, Lynn Smith-Lovin, and James M. Cook. Birds of a feather: Homophily in
social networks. Annual Review of Sociology, 27:415–444, August 2001.
37. Filippo Menczer. Growing and navigating the small world web by local content. Proceedings
of the National Academy of Sciences, 99(22):14014–14019, October 2002.
38. Stanley Milgram. The small world problem. Psychology Today, 1:61–67, May 1967.
39. Michael Mitzenmacher. A brief history of lognormal and power law distributions. Internet
Mathematics, 1(2):226–251, 2004.
40. Jacob L. Moreno. Who Shall Survive? Foundations of Sociometry, Group Psychotherapy and
Sociodrama. Nervous and Mental Disesase Publishing Company, 1934.
41. Van Nguyen and Chip Martel. Analyzing and characterizing small-world graphs. In Proceed-
ings of the 16th ACM–SIAM Symposium on Discrete Algorithms (SODA’05), pages 311–320,
January 2005.
42. Van Nguyen and Chip Martel. Augmented graph models for small-world analysis with
geographical factors. In Proceedings of the 4th Workshop on Analytic Algorithms and Combi-
natorics (ANALCO’08), January 2008.
43. Roper Public Affairs and National Geographic Society. 2006 geographic literacy study, May
2006. http://www.nationalgeographic.com/roper2006.
44. Oskar Sandberg and Ian Clarke. The evolution of navigable small-world networks. Manuscript,
2006. Available as cs/0607025.
45. Georg Simmel. Conflict And The Web Of Group Affiliations. Free Press, 1908. Translated by
Kurt H. Wolff and Reinhard Bendix (1955).
46. Özgür Şimşek and David Jensen. Decentralized search in networks using homophily and
degree disparity. In Proceedings of the 19th International Joint Conference on Artificial Intel-
ligence (IJCAI’05), pages 304–310, August 2005.
47. Aleksandrs Slivkins. Distance estimation and object location via rings of neighbors. In Pro-
ceedings of the 24th Symposium on Principles of Distributed Computing (PODC’05), pages
41–50, July 2005.
48. Duncan J. Watts, Peter Sheridan Dodds, and M. E. J. Newman. Identity and search in social
networks. Science, 296:1302–1305, 17 May 2002.
49. Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks.
Nature, 393:440–442, 1998.
Index
3G cellular networks, vi, 49, 100–102, 152, for r-AoNDM problem, 124
253
4G cellular networks, 99, 104, 121
characterizations of, 100 back-pressure method, 67, 69, 84, 87–90, 93,
94
balanced allocation, 189
ACSM, see approximate concurrent state bandwidth profile, 140, 141, 143–145, 148
machine base station, 48, 58–60, 63, 64, 92, 101, 102,
ad hoc wireless networks, see wireless 104–121, 123, 125, 126
networks capacity, 104
All-or-Nothing Demand Maximization configuration, 101
Problem (AoNDM), 109, 123 coverage area, 106
All-or-Nothing Multicommodity Flow Bayesian Belief Networks, 255, 257
Problem, 122 BGP, see routing, BGP
analysis of variance (ANOVA), 269 BitTorrent, vi, 159, 162
anomaly detection, 239–253, 255–259, 261, BlockRank algorithm, 380–382
263, 265–267, 269, 271, 273, blog search, see search, blog
275–277, 279, 281–283, 285 Bloom filter, 182–188, 198, 199, 203–207
anomaly detection compressed Bloom filter, 188
model-based, 263, 265, 267, 269, 271, 273, counting Bloom filter, 187, 197, 203–205,
275, 277, 279, 281, 283, 285 207
anomaly detection false positive rate, 186
model-based, 252–254, 263–267, 285 partitioned Bloom filter, 184, 186
ANOVA, see analysis of variance Bloomier filter, 206
antennas, 49, 50, 58–60, 62, 69, 91, 101, 104, Bluetooth, 100
110, 118, 119 Bro, 220, 224–226, 231, 233
adaptive, 104 Budgeted Cell Planning Problem (BCPP), 106
directional, 69, 91 Budgeted Maximum Assignment Problem
smart, 104 (BMAP), 112
approximate concurrent state machine, 206, Budgeted Maximum Coverage Problem, 109
207 Budgeted Unique Coverage Problem, 109
approximation algorithm, 78, 99, 102, 103,
124, 126, 127, 380
for budgeted maximum assignment calibration, 279, 281, 282, 285, 416, 417, 419
problem (BMAP), 113 CAM, see content-addressable memory
for client assignment problem (CAP), 112 capacity region, 84–87, 90
for k4k-budgeted cell planning problem cell change, 120
(k4k-BCPP), 110, 114 cell planning, 102
for minimum-cost cell planning problem future factors of, 104
(CPP), 117 interference, 106, 115
457
458 Index
IGP, see Interior Gateway Protocol machine learning, 239–241, 246, 251, 252,
in-network aggregation, 291 256, 258, 430
in-network monitoring, see monitoring MAXFLOW problem, 72, 78, 82, 84, 90
interference, 68, 69, 71 Maximum Concurrent Flow Problem (MCFP),
avoidance, 48, 51, 64 72
cancellation, 48, 58, 59 maximum likelihood estimator (MLE), 309
co-channel, 57, 58, 121 maximum Multi Commodity Flow Problem
intercell, 57 (MFP), 72
management, 47–49, 51, 53, 55, 57, 59, 61, MCFP, see Maximun Concurrent Flow
63, 65 Problem
mitigation, 48, 50, 58, 62 mean-field method, 201
models, 70, 71, 78, 91, 94, 106 measurement noise, 282, 283
mutual, 103, 108 measurement noise, 265, 282, 283
suppression, 64 memory requirements, 219, 240, 242, 244, 245
Interior Gateway Protocol (IGP), 163–165, MFP, see maximum Multicommodity Flow
168, 175 Problem
Internet Engineering Task Force (IETF), v, 142 MHT, see hash table
Internet tomography, 320, 323–325, 328, 338 Milgram, Stanley, 388, 437, 438, 441, 446,
intrusion detection, 211, 220, 224, 228, 237, 447, 450
245, 255 MIMO, 47
460 Index
randomized load-balancing, see Valiant server discovery, 408, 412–416, 419, 422, 423,
load-balancing 428, 429
rank-based friendship, 444, 445, 448, 452 ServerRank algorithm, 382, 383
real-time applications, 11, 56, 143, 147, 157, service
259, 264, 267, 285, 309, 310, 404, overlay, 159, 165
406 services
regular expressions, 208, 219–224, 226–230, Ethernet, 131–135, 140, 147, 148
232, 238 shadow scheduling algorithm, 51, 54, 56
rewriting, 229, 234 signature, 196, 206, 207, 212, see fingerprint
restricted All-or-Nothing Demand Maxi- SiteRank algorithm, 382, 383
mization Problem (r-AoNDM), sketches, 196, 210, 243, 245
123 small-world phenomenon, 374, 388, 438
reverse hashing, 245 smart antennas, 104
Round Trip Time (RTT), 173, 410, 411, 419,
SNMP, 152, 209, 288
420
SNORT, 211, 220, 223–228, 231, 233, 238
Round Trip Time (RTT), 170, 173, 408,
social networks, vii, 373, 374, 386, 387, 393,
411–417, 419, 420
435–439, 441, 443–445, 447–449,
routing
451, 453–455
application-layer, 157–159, 175
social search, see search, social
BGP routing, 158, 161, 253, 418
greedy routing, 438, 440–444, 446, 449, spanning tree, 143, 149, 235, 291, 302, 305
450 SRAM, 192, 193, 240
hot-potato routing, 6, 7, 158 statistical anomaly detection, 240, 246, 264,
IP routing, v, 162–164 267
mesh routing, 76, 77 statistical estimation, 263, 264, 266, 282
overlay routing, 157, 159, 163, 167, 168, streaming algorithms, 239, 241, 242, 258
319 submodular function, 112
permutation routing, 76, 77 synchronization, 59, 64, 161, 168–174
selfish routing, 167
shortest-path routing, 5–7, 9, 161, 437,
439, 440, 448
TCAM, see content-addressable memory
single-path routing, 8, 15
RTT, see Round Trip Time TCP, 5, 7, 14, 16, 146, 174, 181, 209, 213, 243,
253, 406, 407, 429
TF-IDF algorithm, 375, 394
threshold crossing alerts (TCAs), 291, 308,
sample-and-hold, 209, 242
313–315
scalability analysis, 335
scaling laws, 69, 91 throughput maximization, 67, 76, 91
search traffic
blog search, 392 classification, 142, 430
blog search, 393 engineering, 5–7, 12, 16, 20, 148, 161–163,
distributed search, 374, 379, 382 289
news search, 395, 396, 398 measurement, 6, 182, 187, 195, 240, 242,
peer-to-peer search, 379 252, 256, 320
searching with context, 397 oscillations, 165, 168, 170, 175
social tagging search, 373, 374, 387, 390, overhead, see protocol overhead
391, 398 peering, 27, 28
social search, 386, 390 synthesis, 427, 428
social search, 374, 386, 387 transient interactions, 167, 168
web search, 375, 377 transparent optical network, 263–265, 267,
web search, vi, 374, 387, 392 269, 271, 273, 275, 277, 279, 281,
second chance scheme, 200, 201 283, 285
semi-supervised learning, 257 transport services, 132, 149, 151
server Discovery, 408 two-phase routing, see Valiant load-balancing
462 Index
UDP, 243, 406–409, 412–417, 422, 429, 431 web graph, 374, 377, 380, 382
undirected graph, 324, 331, 332, 334 web search, see search, web
uniform load-balancing, see load-balancing, wireless networks, 47, 49, 51, 53, 55, 57, 59,
uniform 61, 63, 65, 67, 73, 78, 91
unsupervised learning, 239–241, 252–255 wireless networks, 67–69, 73, 92
ad hoc, 68, 70, 71
Wolfenstein, 422, 423, 428–431
Valiant load-balancing, 19–23, 25–27, 29, 149 worm, 196, 212, 244, 251, 253, 254
Valiant, Leslie, 20 worm signature, see fingerprint
virtual environment, 404, 405, 409
virtual world, 404, 409, 421, 452
Voice over IP (VoIP), 56, 146
XBox, 408, 424–426