N Nterconnect Rchitecture FOR Etworking Ystems On Hips: A I A N S C
N Nterconnect Rchitecture FOR Etworking Ystems On Hips: A I A N S C
ij
=
ij
/
ij
= / = . Utilization is the aver-
age amount of service demand arriving with-
in one time unit. The aggregated arrival rate
is
tot
= S
ij
ij
. Total utilization
total
=
tot
/.
We model the shared bus as a single server
queue with Poisson arrivals and exponential
service time. Consider the aggregated request
arrival process for the eight nodes. Because the
individual arrival process is a Poisson distrib-
ution, the superposition is also a Poisson dis-
tribution.
12
In memory access applications,
service rate corresponds to memory access
speed. We ignore all propagation delays. The
server serves queued requests in rst-in, rst-
out (FIFO) order. An arriving requests
response time is the difference between the
arrival time and time the bus completes the
service. Because the server is work conserv-
ing,
13
the expected response time, denoted by
EW
bus
, for an arbitrary request arriving at the
single server queue is identical to that for a
single-server multiple-queue system. The
expected response time of a shared bus mod-
eled by a single-server queue is
13
For crossbar throughput, we use the model
presented by J. Chen and T. Stern.
12
They
showed that for a large switch (approximate-
ly 20 nodes) having a speedup factor of one,
response time is
Chen and Stern also investigated the impact
of various arbitration policies on switch per-
formance. They found that arbitration poli-
cies do not affect maximum throughput
because it is only a function of the average ser-
vice. Different arbitration policies result in
different response times, however.
Communication request scheduling
We investigate the Octagon architecture per-
formance through simulation, using a simple
request-response trafc model. That is, a source
EW EW
E W
EW
EW p
p
xbar s
s
s
s o
o
= +
( )
=
( )
+
=
tot
tot
tot
tot
tot
where
and
2
2
2 8
8
8
1
1
8
,
,
.
EW
bus
=
( )
tot
tot tot
1
.
39 SEPTEMBEROCTOBER 2002
node generates a request to send to a destina-
tion node. It eventually establishes a connection
for the communication. For each connection,
the source node sends a request and receives a
response. After the communications, the par-
ticipating nodes sever the connection.
We associate a processor and memory mod-
ule with each node, as Figure 3 shows. Appli-
cations of this traffic model exist in routing
table lookup, Internet protocol packet classi-
fication, and other networking functions
where each node generates memory access
requests. If the requested memory location is
attached to the local node, then it generates no
Octagon communication requests. Otherwise,
it must forward the request to the appropriate
node via Octagon using the routing algorithm
previously presented. At the destination node,
the memory request consumes several clock
cycles before spawning a response, which it
returns to the originating node.
The best-t scheduler is a connection-ori-
ented communication protocol that can simul-
taneously accommodate nonoverlapping
connections. Each node maintains three queues
of outstanding requests, one for each egress
link. With respect to the overall network, this
global scheduler gives priority to the head-of-
line requests in arrival time order (lower arrival
time implies higher priority). At every service
completion time, the scheduler checks to see if
it can make new connections based on the pre-
viously described priority scheme. The sched-
uler sets up connections until it can
accommodate no more without violating the
nonoverlapping rule. Note that we only con-
sider head-of-line requests at each node. When
a connection is torn down, the scheduler reac-
tivates to check if it can set up new connections.
Figure 4 is a detailed view of the node
model. In addition to the request generator,
processor, and memory, each node has three
ingress and three egress ports labeled left,
across, or right, consistent with its associated
neighbor. Logically, the node emulates a sim-
ple 4 4 nonblocking switch (plus processor
and memory). The switch has neither input
nor output buffering. The central scheduler
performs switch arbitration.
Performance results
Figure 5 compares the expected response
time of Octagon, a bus, and a crossbar. We x
= 0.5 per clock cycle and vary
tot
. The hor-
izontal axis represents
tot
=
tot
/, and the ver-
tical axis represents expected response time in
clock cycles. Recall that
tot
is the systems total
packet arrival rate; 1/, its average packet size;
40
NETWORK SOC COMMUNICATIONS
IEEE MICRO
0
4
6 2
1
3
7
5
P
0
M
0
P
7
M
7
P
1
M
1
P
6
M
6
P
5
M
5
P
3
M
3
P
2
M
2
P
4
M
4
Figure 3. High-level application model. Each node is associ-
ated with a processor and a memory module.
A
L
R
A
L
R
Scheduler
Request
generator
Memory Processor
Arbiter
Nonblocking
switch
I
n
g
r
e
s
s
E
g
r
e
s
s
Figure 4. Node model. Each node has a request generator, processor, mem-
ory, and three ingress and three egress ports. Switch arbitration is through
the central scheduler.
and
tot
, average number of
packets the system can service
concurrently. Octagon has
significantly higher maxi-
mum throughput than both
the bus and crossbar. We
obtain similar results for
xed-size packets.
As these results show, the
bus saturates at
tot
= 1
because a single server (the
bus bandwidth) can provide
at most one service unit per
time unit. For the crossbar,
we assume a single queue per
crossbar nodehence we can
model the crossbar as a sys-
tem of eight queues sharing
eight servers. Contention and
head-of-line blocking mean
that the eight available servers
provide approximately four
work units per time unit.
12
Therefore, crossbar saturation
occurs at
tot
4. We could
have considered eight queues
per crossbar node, but the
implementation cost would
be prohibitive.
On the other hand, it
might be reasonable for each
Octagon node to have three
queues. Hence we model the
Octagon architecture as a sys-
tem of 24 queues and 24
servers (three egress queues
and three outgoing links per
node). This means we incur
more cost per node for Octa-
gon than for the crossbar.
However, saturation for Octagon occurs at
tot
12, which is signicantly higher than for a
crossbar, as Figure 5 shows. Note that the
effective server utilization is about 50 percent
(12), the same as for the crossbar.
Some packet service approaches achieve high
throughput by compromising service latency.
That is, system efciency and throughput
increase as the workload at each queue builds.
Figure 6 shows that at relatively high utiliza-
tion of
tot
= 12 and at 10
-4
packet loss proba-
bility, a system using an Octagon architecture
requires fewer than 50 packet buffers. Thus,
the average queue occupancy is not excessive.
For Octagons best-t connection schedul-
ing, a node (process) is not blocked if the
scheduler cannot schedule its communication
request immediately. Instead, the requesting
node queues the request in its egress queue.
This strategy can improve system performance
and node utilization more than some com-
munication protocols (especially most bus pro-
tocols), which stall the requesting node until its
request can be granted. However, each Octa-
gon node must have a queue large enough to
avoid packet loss. Figure 6 shows the packet
41 SEPTEMBEROCTOBER 2002
0
10
20
30
40
50
60
70
Utilization,
tot
L
a
t
e
n
c
y
(
c
l
o
c
k
c
y
c
l
e
s
)
0.1 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.0 12.0 13.0 14.0
Octagon
best-fit
scheduler
Crossbar
Bus
Figure 5. Throughput comparison of Octagon, a bus, and a crossbar for randomized packet
1.0
0.1
0.01
0.001
0.0001
P
a
c
k
e
t
l
o
s
s
p
r
o
b
a
b
i
l
i
t
y
5 10 15 20
Egress queue size
25 30 35
Figure 6. Packet loss probability versus egress queue size for a system using the Octagon
architecture.
loss probability as the size of each nodes egress
queue increases. Note that a queue size of 25
results in a nominal packet loss of 0.1 percent,
while a queue size of 35 reduces the packet loss
probability to 0.01 percent. If needed, a sys-
tem designer can enable a zero packet loss guar-
antee in Octagon by having the packet
scheduler refuse requests if the egress queue is
at full or near-full capacity, thereby stalling the
requesting node (as many existing buses do).
Scalability
The increasing performance demands of
programmable network processors makes scal-
ability an important factor in Octagons
design. Next-generation network processors
will likely have 16 or more processors and
many distributed memory components hold-
ing tables for Internet protocol lookup and
classification. The need for interconnecting
greater numbers of on-chip components in
network processors and other SOCs will accel-
erate in the foreseeable future, increasing the
need for scalable, on-chip communication
architectures.
Strategy 1: Low wiring complexity
One of the Octagon architectures strengths
is its ability to scale linearly. Figure 7a shows
a scaling strategy that requires two different
node types: bridge and member. As the name
implies, bridge nodes connect adjacent
Octagons and perform hierarchical packet
routing (for example, node y in Figure 7a).
Member nodes attach to only one Octagon
(node x, for example). Consider a network
consisting of eight interconnected Octagons.
The Octagon address eld of each packet is 6
bits wide: three high-order bits to identify the
local Octagon and three low-order bits to
identify the node within the Octagon. Each
bridge node performs static routing based on
the entire eld, and each member node per-
forms routing based only on the three low-
order bits.
Figure 7 also shows the estimated wiring cost
as a function of increasing network size for the
Octagon and crossbar architectures. To arrive
at these estimates, we assumed the SOC lay-
out in Figure 2a for each octagon. We extend-
ed the layout to the conguration of Figure 2b
for crossbars with more than eight nodes.
Octagon scales linearly and the crossbar does
not because in the crossbar, every node is wired
to every other node. As the number of nodes
N grows, the number of required wires is of
42
NETWORK SOC COMMUNICATIONS
IEEE MICRO
O
c
t
a
g
o
n
2
1
a
b
c
2
4
6 (shown)
y x
Nodes
Horizontal links
no./ max length
(mm)
Vertical links
no./ max length
(mm)
12/0.156
24/0.156
36/0.156
12/8
24/8
36/8
8
15
22 (shown)
Maximum
distance
(hops)
C
r
o
s
s
b
a
r
1
1
1
Nodes
Horizontal links
no. /max length
(mm)
Vertical links
no. /max length
(mm)
32/0.108
120/0.192
242/0.276
8/8
15/16
22/11
8
15
22
Maximum
distance
(hops)
Figure 7. Scaling strategy 1. Bridge nodes (node y) connect adjacent Octagons (a) and perform hierarchical packet routing.
Member nodes attach to only one Octagon (node x). The tables give the maximum distance for Octagon (b) and crossbar (c)
networks of various sizes.
(a)
(b)
(c)
order O(N
2
). In this Octagon scaling strategy,
on the other hand, each node requires either
three or six wires to its neighbors, resulting in
wiring complexity of O(cN).
The tables in Figure 7 shows the maximum
distance (the maximum number of hops
between any two nodes) for networks of var-
ious sizes. The maximum distance increases
linearly as Octagon grows, while the crossbar
has a constant maximum distance irrespective
of the number of nodes. Although a higher
maximum distance can degrade performance,
the performance results in Figure 5 indicate
that this might not always be the case: An 8-
node Octagon with maximum distance 2 per-
forms better than an 8-node crossbar with
maximum distance 1.
In this strategy, the maximum distance
between nodes grows much more slowly, but
it does not remain constant as for the crossbar.
This is fine for SOCs where low wire com-
plexity is the dominant consideration. How-
ever, this characteristic might not suit systems
where high throughput is the primary concern.
For example, consider a SOC with 15 nodes.
A bridge node connects Octagons 1 and 2. If
all network trafc is across networks, then a
bottleneck occurs at the bridge because it must
transmit all trafc from Octagon 1 over the
three links a, b, and c. Therefore, it can con-
currently transfer at most three packets in each
direction, one measure of a communication
architectures maximum throughput.
Strategy 2: High performance
For systems in which high performance is
the dominant consideration, we propose a sec-
ond scaling strategy that performs better than
the rst but has more complex wiring. In this
strategy, we extend Octagon to multidimen-
sional space. Figure 8a illustrates this scaling
strategy in a 64-node Octagon. We index each
SOC node by the 2-tuple (i, j), i, j [0, 7].
For each i = I, I [0, 7], we construct an Octa-
gon using nodes {(I, j), j [0, 7]}, which
results in eight individual Octagon structures.
We then connect these Octagons to each other
by linking corresponding i nodes according to
the Octagon conguration. That is, each node
(I, J) belongs to two Octagons: one consisting
of nodes {(I, j) j [0, 7]}, and the other con-
sisting of nodes {(i, J) i [0, 7]}. The table in
Figure 8 indicates the increase in wiring com-
plexity (number of links) and maximum dis-
tance (number of hops) as the number of
connected nodes increases. The maximum dis-
tance between nodes scales much better under
strategy 2 than strategy 1. However, strategy
2s better performance scalability comes at the
cost of greater wiring complexity.
Figure 9 (next page) shows a natural scaling
approach for networks with fewer nodes, and
therefore fewer Octagons to connect (each
node represents an Octagon). We scale the net-
work as follows. To connect two Octagons, we
construct a link between corresponding nodes;
this bidirectional link connects node i from
43 SEPTEMBEROCTOBER 2002
Nodes
8
64 (shown)
512
Links
12
192
2,304
Maximum
distance (hops)
2
4 (shown)
6
Figure 8. Scaling strategy 2. We extend the Octagon to the multidimensional space by linking corresponding nodes in adja-
cent Octagons according to the Octagon conguration (a). The table (b) indicates the increased wiring complexity (number of
links), and the greater maximum distance needed as the number of connected nodes increases.
(a) (b)
Octagon 1 to node i from Octagon 2. As the
number of Octagons to be connected increas-
es, we link corresponding nodes according to
the Octagon rule. That is, to connect eight
Octagons, 12 bidirectional links are needed to
connect each node i of Octagons 0, 1, 2, ,
7. For a network with more nodes, we start
adding links in a new dimension. As Figure 9
shows, by increasing the network size we can
maintain the maximum hop count at three
while increasing the number of nodes to 32.
These nodes represent a network of four
Octagons with two hops to the corresponding
intra-Octagon node and one hop to the desti-
nation Octagon.
Figure 10 shows Octagons advantage over
the crossbar architecture as the number of
nodes increases. Although the crossbars
implementation cost (measured in number of
links) increases prohibitively with increasing
nodes, the Octagon scaling strategies make
scaling feasible.
O
ur analysis shows that Octagon signi-
cantly outperforms shared bus and
crossbar on-chip communication architec-
tures in terms of performance, implementa-
tion cost, and scalability. We are currently
investigating the use of Octagon to satisfy the
on-chip communication needs of other appli-
cation-specic multiprocessor SOCs. MI CRO
Acknowledgments
We acknowledge Naresh Soni for his
encouragement and support, and Razak Hos-
sain for his help on physical layout and imple-
mentation issues.
References
1. V.P. Kumar, T.V. Lakshman, and D. Stiliadis,
Beyond Best Effort: Router Architectures
for the Differentiated Services of Tomor-
rows Internet, IEEE Comm., vol. 36, no. 5,
May 1998, pp. 152-164.
2. T.V. Lakshman and D. Stiliadis, High-Speed
Policy-Based Packet Forwarding Using Ef-
cient Multidimensional Range Matching,
Proc. ACM SIGCOMM, ACM Press, New
York, 1998, pp. 191-202.
3. K. Lahiri, A. Raghunathan, and S. Dey, Eval-
uation of the Traffic Performance Charac-
teristics of System-on-Chip Communication
Architectures, Proc. 14th Intl Conf. VLSI
Design, IEEE CS Press, Los Alamitos, Calif.,
2001, pp. 29-35.
4. J.A. Rowson and A. Sangiovanni-Vincentel-
li, Interface-Based Design, Proc. 34th Ann.
44
NETWORK SOC COMMUNICATIONS
IEEE MICRO
Figure 9. Growing the network using scaling strategy 2.
1 5 9 13 17 21 25 29 33 37 41 45 49 57 53
Number of nodes
4,500
4,000
3,500
3,000
2,500
2,000
1,500
1,000
500
0
N
u
m
b
e
r
o
f
l
i
n
k
s
Crossbar
Strategy 1
Strategy 2
Figure 10. Comparison of Octagon scaling strategies to a crossbar architecture: number of
nodes versus wiring complexity.
Design Automation Conf., ACM Press, New
York, 1997, pp. 178-183.
5. R.B. Ortega and G. Borriello, Communica-
tion Synthesis for Distributed Embedded
Systems, Proc. Intl Conf. Computer-Aided
Design (ICCAD 98), IEEE CS Press, Los
Alamitos, Calif., 1998, pp. 437-444.
6. M. Gasteier and M. Glesner, Bus-Based
Communication Synthesis on System Level,
ACM Trans. Design Automation Electronic
Systems, vol. 4, no. 1, Jan. 1999, pp. 1-11.
7. K. Lahiri, A. Raghunathan, and S. Dey,
Communication Architecture Tuners: A
Methodology for the Design of High-Perfor-
mance Communication Architectures for
System-on-Chips, Proc. 37th Design
Automation Conf., ACM Press, New York,
2000, pp. 513-518.
8. K. Lahiri, A. Raghunathan, and S. Dey, Ef-
cient Exploration of the SOC Communication
Architecture Design Space, Proc. Intl Conf.
Computer-Aided Design (ICCAD 00), IEEE CS
Press, Los Alamitos, Calif., 2000, pp. 424-430.
9. Sonics Integration Architecture, www.
sonicsinc.com.
10. D. Flynn, AMBA: Enabling Reusable On-
Chip Designs, IEEE Micro, vol. 7, no. 4,
July-Aug. 1997, pp. 20-27.
11. A. Charlesworth, The Sun Fireplane Inter-
connect, IEEE Micro, vol. 22, no. 1, Jan.-
Feb. 2002, pp. 36-45.
12. J. Chen and T. Stern, Throughput Analysis,
Optimal Buffer Allocation, and Trafc Imbal-
ance Study of a Generic Nonblocking Pack-
et Switch, IEEE J. Selected Areas in
Comm., vol. 9, no. 3, Apr. 1991, pp. 439-449.
13. D. Gross and C. Harris, Fundamentals of
Queueing Theory, 3rd ed., John Wiley &
Sons, New York, 1998, pp. 297-300.
Faraydon Karim is an ST Fellow at STMicro-
electronics Advanced System Technology,
Advanced Computing Lab in La Jolla, Cali-
fornia. His research interests include comput-
er and embedded system architecture. Karim
has a PhD in computer engineering from La
Salle University. He is a member of the IEEE.
Anh Nguyen is a research engineer at the
STMicroelectronics Advanced Systems Tech-
nology, Advanced Computing Lab in La Jolla,
California. His research interests include per-
formance analysis, resource allocation, and
algorithm development. Nguyen has a PhD
in electrical engineering from the University
of California, Los Angeles. He is a member of
the IEEE.
Sujit Dey is a professor in the Electrical and
Computer Engineering Department at the
University of California, San Diego. His
research interests include configurable plat-
forms consisting of adaptive wireless proto-
cols and algorithms, and deep-submicron
adaptive SOCs for next-generation wireless
appliances and network infrastructure devices.
Dey has a PhD in computer science from
Duke University. He is a member of the IEEE.
Direct questions and comments to Faray-
don Karim at STMicroelectronics, Advanced
System Technology, San Diego, CA 92121;
[email protected].
For further information on this or any other
computing topic, visit our Digital Library at
http://computer.org/publications/dlib.
45 SEPTEMBEROCTOBER 2002
Join a community that targets your discipline.
In our Technical Committees, youre in good company.
computer.org/TCsignup/
L
ooking for a community targeted to your
area of expertise? Computer Society
Technical Committees explore a variety
of computing niches and provide forums for
dialogue among peers. These groups inuence
our standards development and offer leading
conferences in their elds.
JOIN A
THINK
TANK