0% found this document useful (0 votes)
36 views

N Nterconnect Rchitecture FOR Etworking Ystems On Hips: A I A N S C

Netsoc

Uploaded by

Marwan Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

N Nterconnect Rchitecture FOR Etworking Ystems On Hips: A I A N S C

Netsoc

Uploaded by

Marwan Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

36

To meet the demands of ever-increas-


ing Internet trafc, the next generation of Inter-
net backbone routers must deliver ultrahigh
performance over an optical infrastructure. At
the current Internet trafc growth rate, network
service providers will likely deploy OC-768
routers in the foreseeable future. At the same
time, as Internet and application service
providers attempt to provide more diverse and
differentiated services, routers will have to take
on new tasks. In addition to routing and pack-
et forwarding, routers will likely perform pack-
et classication, distinguishing packets and
grouping them according to their requirements;
buffer management, determining buffer alloca-
tion and admission control for packets; and
packet scheduling, determining how to sequence
packets to meet service level agreements (SLA).
1
Traditionally, routers have used general-pur-
pose reduced-instruction-set computer (RISC)
processors or application-specic ICs (ASICs).
Although general-purpose, processor-based
router architectures provide the exibility to
upgrade to new router tasks, they will not sat-
isfy the growing speed requirements for new,
complex, packet-processing tasks. On the other
hand, ASIC-based router implementations can
provide the speed but not the required pro-
gramming exibility. These shortcomings of
traditional RISC and ASIC designs mean that
designers must develop new high-speed net-
work processors that permit exible program-
mability and work at OC-768 speed.
At OC-768 (40 Gbps), IP packet arrival rate
could reach approximately 114 x 10
6
packets
per second (assuming 44 bytes per packet). To
ensure that the worst-case time to process a
packet does not exceed the packet arrival rate
and thus violate SLAs, packet-processing time
should be at most 9 ns per packet. To accom-
modate this requirement, a network processor
must perform approximately 500 instructions
on each arriving packet to enable packet for-
warding and classification on packet flows.
Hence, an OC-768 network processor must
process 57 billion instructions per second, a
performance level a multiprocessor system-on-
a-chip (SOC) architecture can provide.
Octagon is a novel on-chip communication
architecture that can meet the performance
requirements of network processor SOCs.
Octagons cost, performance, and scalability
advantages make it suitable for the aggressive
on-chip communication demands of not only
networking SOCs, but also SOCs in several
other domains.
Faraydon Karim
Anh Nguyen
STMicroelectronics
Sujit Dey
University of California,
San Diego
NETWORK PROCESSOR SYSTEMS ON CHIPS MEET THE SPEED AND FLEXIBILITY
REQUIREMENTS OF NEXT-GENERATION INTERNET ROUTERS. THE OCTAGON ON-
CHIP COMMUNICATION ARCHITECTURE, WITH ITS COST, PERFORMANCE, AND
SCALABILITY ADVANTAGES, SUPPORTS THESE NETWORK PROCESSOR SOCS.
0272-1732/02/$17.00 2002 IEEE
AN INTERCONNECT ARCHITECTURE
FOR NETWORKING
SYSTEMS ON CHIPS
High-performance communications
Consider the on-chip communication
requirements typical network processor appli-
cations impose. Using T.V. Lakshman and D.
Stiliadis packet classication algorithm with
an estimated 10,000 classication rules and
16-bit on-chip memory width,
2
we must per-
form 625 memory accesses per packet arrival,
or 71.3 10
9
memory accesses per second (in
the worst case). Clearly, this necessitates the
use of multiple memory components, and an
on-chip communication architecture that
enables highly concurrent, high-speed com-
munication between the multiple processor
and memory components.
Recent studies have demonstrated the sig-
nicant role an on-chip communication archi-
tecture plays in determining a SOCs overall
performance.
3
Several techniques let us design
and synthesize on-chip communication to sat-
isfy components interface and communica-
tion needs in an application-specic system.
46
However, because one of a network processors
primary goals is to efciently execute multiple
applications (including evolving networking
applications), synthesizing an application-spe-
cic interconnect architecture for a network
processor SOC will not work.
Rather than synthesize a custom on-chip
interconnect architecture for a given applica-
tion, K. Lahiri and colleagues propose to opti-
mally map a systems communication
requirements to a given communication archi-
tecture.
7
In other work, they describe a tech-
nique that allows reconfiguration of the
selected communication architectures proto-
cols according to the applications changing
communication demands.
8
Proposed com-
munication mapping and reconfiguration
techniques provide up to an order of magni-
tude improvement in system performance.
Taken together, mapping and reconguration
techniques show promise for efciently map-
ping multiple applications to the same inter-
connect fabric.
7,8
For these techniques to be
successful, however, developers must select the
appropriate on-chip communication archi-
tecture.
Despite recent advances in the analysis and
design of high-performance on-chip commu-
nication architectures, commercial SOCs
commonly use simple bus-based topologies
and protocols.
9,10
Even the Virtual Socket
Interface Alliances efforts
have focused on bus interface
standards.
Although bus-based on-
chip communication might
be suitable for many applica-
tions, it clearly cannot satisfy
an OC-768 network proces-
sors very demanding on-chip
communication needs. For
high-performance computing
systems, interconnect archi-
tectures based on crossbars, or
crossbars mixed with buses,
deliver the ultrahigh-perfor-
mance communication need-
ed among components.
11
Many switching architec-
tures in high-performance
routers also use crossbars.
1
An
on-chip crossbar can satisfy the on-chip com-
munication needs of an OC-768 network
processor SOC. Theoretically, crossbar per-
formance (in terms of throughput or delay) is
high enough to permit development of ef-
cient network processing tasks.
12
In reality,
crossbar implementation costs are high: Cross-
bars require many on-chip wires and relays to
minimize clock skew across the chip. In addi-
tion, crossbars do not scale well as the num-
ber of nodes to be connected increases.
Although crossbar-based interconnects might
be justied in high-performance computing
systems and routers, they might not be the best
economic choice for lower-cost and higher-
volume network processor SOCs.
Octagon
The Octagon on-chip architecture is sim-
pler to implement than a crossbar yet has
much higher throughput than either shared
buses or traditional crossbars. Unlike cross-
bars, Octagons implementation complexity
increases linearly with the number of nodes
processor or memory componentsthat the
network must connect.
Architecture
As Figure 1 shows, a basic Octagon unit
consists of eight nodes and 12 bidirectional
links.
The Octagon architecture has several desir-
able properties:
37 SEPTEMBEROCTOBER 2002
0
4
6 2
1
3
7
5
Figure 1. Basic Octagon conguration
includes eight nodes and 12 bidirectional
links.
two-hop communication between any
pair of nodes;
higher aggregate throughput than a shared
bus or crossbar interconnect under certain
implementation conditions;
simple, shortest-path routing algorithm;
and
less wiring than a crossbar interconnect.
Octagon operates in packet- or circuit-
switched mode. An Octagon packet is data
that must be transferred from the destination
Octagon node to the source Octagon node
as a result of a communication request by the
source node. An Octagon packet can be xed
or variable length. In packet-switched mode,
the network nodes buffer packets at inter-
mediate nodes if there is contention at the
egress link.
In circuit-switched mode, a network
arbiter allocates the entire path between
source and destination nodes of a communi-
cating node pair for a number of clock cycles.
Nonoverlapping communication paths can
occur concurrentlythat is, the arbiter per-
mits spatial reuse. In this mode, system per-
formance is a function of the chosen
connection schedule. The question is, then,
given the set of pending communication
requests, how should the arbiter schedule
connections to optimize throughput (or some
other metric)? We have developed a simple
connection scheduler, called the best-t algo-
rithm (described later), to enable Octagons
circuit-switched mode.
Packet routing
We can code Octagon node addresses into
a three-bit eld and route an Octagon packet
as follows. We prepend a three-bit tag to each
packet. Each node compares the tag (Pack-
et_addr) to its own address (Node_addr) to
determine the next action. The node com-
putes the relative address of a packet as
Rel_addr = Packet_addr Node
addr (modulo 8)
At each node on the Octagon, packet rout-
ing is a function of Rel_addr:
Rel_addr = 0, process at node
Rel_addr = 1 or 2, route clockwise
Rel_addr = 6 or 7, route counterclock-
wise
Route across otherwise
Consequently, a predetermined, simple rout-
ing scheme for each network packet permits at
most two hops to separate any two nodes.
Implementation cost
Figure 2 illustrates the physical layout of
the Octagon and crossbar interconnect archi-
tectures. In our network processor, each Octa-
gon node consists of a processor-memory pair
with an estimated size of 2 mm 2 mm. Let
us assume that the minimum wire spacing is
0.2 m, and the width of a 32-bit link is
12m (including individual wire width, spac-
ing, and shielding). As Figure 2a shows, the
38
NETWORK SOC COMMUNICATIONS
IEEE MICRO
4
3 0
7
2 1
6 5
1 0 2 3
6 7 5 4
(a) (b)
Figure 2. Octagon (a) and crossbar (b) physical-layout schematic examples. Octagon consists
of 12 horizontal and 12 vertical 32-bit tracks with each horizontal track upper-bounded by 8
mm, and each vertical track upper-bounded by 0.156 micron (13 x 12 micron). The crossbar
has eight horizontal and 32 vertical 32-bit tracks, with the horizontal tracks upper-bounded by
8 mm as in the Octagon, and the vertical tracks upper-bounded by 0.108 micron (9 x 12
Octagon architecture consists of 12 horizon-
tal and 12 vertical 32-bit tracks. Each hori-
zontal track is upper-bounded by 8 mm, the
total width of the four nodes. Each vertical
track is upper-bounded by 0.156 micron (13
12 micron). As Figure 2b shows, the cross-
bar needs eight horizontal and 32 vertical 32-
bit tracks. Although the horizontal tracks are
upper-bounded by 8 mm as in the Octagon,
the vertical tracks are upper-bounded by
0.108 micron (9 12 micron). Thus, wiring
in Octagon is less complex than in a crossbar.
Octagon versus crossbars and buses
Consider a typical SOC communication.
Node processes continuously generate requests
for service; examples include memory read and
write requests. We classify requests according
to their source-destination pair; thus, we denote
requests that originate from node i with desti-
nation node j as type ij. Requests of type ij
arrive at the system following the Poisson
process with parameter
ij
, which is the
requests arrival rate. Service time is the time a
destination node requires to complete all
requested tasks if it processes the request in iso-
lation. We assume the communication links
between source and destination nodes are
locked until service is completed.
Service and response times
The required service time is equivalent to
packet size or link rate, where packet size is
the number of bytes of data transfer that result
from the communication request, and link
rate is the communication links data transfer
rate. We assume that the service time for
request ij is exponentially distributed with
parameter
ij
, with 1/
ij
as the average service
time per requesta reasonable assumption
because packet length varies and, in the most
general case, could range from one to thou-
sands of bytes. For example, if a node issues a
read request for an 8-byte data block, then for
a 1-MHz 8-bit wide data bus, the request ser-
vice time is 8 microseconds.
We can easily extend these assumptions to
accommodate other discrete-timed distribu-
tion such as Bernoulli arrivals, and geometric
or deterministic service time. We consider a
symmetric system where
ij
= , i,j and
ij
=
, i j. The utilization of requests type ij is

ij
=
ij
/
ij
= / = . Utilization is the aver-
age amount of service demand arriving with-
in one time unit. The aggregated arrival rate
is
tot
= S
ij

ij
. Total utilization
total
=
tot
/.
We model the shared bus as a single server
queue with Poisson arrivals and exponential
service time. Consider the aggregated request
arrival process for the eight nodes. Because the
individual arrival process is a Poisson distrib-
ution, the superposition is also a Poisson dis-
tribution.
12
In memory access applications,
service rate corresponds to memory access
speed. We ignore all propagation delays. The
server serves queued requests in rst-in, rst-
out (FIFO) order. An arriving requests
response time is the difference between the
arrival time and time the bus completes the
service. Because the server is work conserv-
ing,
13
the expected response time, denoted by
EW
bus
, for an arbitrary request arriving at the
single server queue is identical to that for a
single-server multiple-queue system. The
expected response time of a shared bus mod-
eled by a single-server queue is
13
For crossbar throughput, we use the model
presented by J. Chen and T. Stern.
12
They
showed that for a large switch (approximate-
ly 20 nodes) having a speedup factor of one,
response time is
Chen and Stern also investigated the impact
of various arbitration policies on switch per-
formance. They found that arbitration poli-
cies do not affect maximum throughput
because it is only a function of the average ser-
vice. Different arbitration policies result in
different response times, however.
Communication request scheduling
We investigate the Octagon architecture per-
formance through simulation, using a simple
request-response trafc model. That is, a source

EW EW
E W
EW
EW p
p
xbar s
s
s
s o
o
= +
( )
=
( )
+
=

tot
tot
tot
tot
tot
where
and
2
2
2 8
8
8
1
1
8
,
,
.

EW
bus
=
( )


tot
tot tot
1
.
39 SEPTEMBEROCTOBER 2002
node generates a request to send to a destina-
tion node. It eventually establishes a connection
for the communication. For each connection,
the source node sends a request and receives a
response. After the communications, the par-
ticipating nodes sever the connection.
We associate a processor and memory mod-
ule with each node, as Figure 3 shows. Appli-
cations of this traffic model exist in routing
table lookup, Internet protocol packet classi-
fication, and other networking functions
where each node generates memory access
requests. If the requested memory location is
attached to the local node, then it generates no
Octagon communication requests. Otherwise,
it must forward the request to the appropriate
node via Octagon using the routing algorithm
previously presented. At the destination node,
the memory request consumes several clock
cycles before spawning a response, which it
returns to the originating node.
The best-t scheduler is a connection-ori-
ented communication protocol that can simul-
taneously accommodate nonoverlapping
connections. Each node maintains three queues
of outstanding requests, one for each egress
link. With respect to the overall network, this
global scheduler gives priority to the head-of-
line requests in arrival time order (lower arrival
time implies higher priority). At every service
completion time, the scheduler checks to see if
it can make new connections based on the pre-
viously described priority scheme. The sched-
uler sets up connections until it can
accommodate no more without violating the
nonoverlapping rule. Note that we only con-
sider head-of-line requests at each node. When
a connection is torn down, the scheduler reac-
tivates to check if it can set up new connections.
Figure 4 is a detailed view of the node
model. In addition to the request generator,
processor, and memory, each node has three
ingress and three egress ports labeled left,
across, or right, consistent with its associated
neighbor. Logically, the node emulates a sim-
ple 4 4 nonblocking switch (plus processor
and memory). The switch has neither input
nor output buffering. The central scheduler
performs switch arbitration.
Performance results
Figure 5 compares the expected response
time of Octagon, a bus, and a crossbar. We x
= 0.5 per clock cycle and vary
tot
. The hor-
izontal axis represents
tot
=
tot
/, and the ver-
tical axis represents expected response time in
clock cycles. Recall that
tot
is the systems total
packet arrival rate; 1/, its average packet size;
40
NETWORK SOC COMMUNICATIONS
IEEE MICRO
0
4
6 2
1
3
7
5
P
0
M
0
P
7
M
7
P
1
M
1
P
6
M
6
P
5
M
5
P
3
M
3
P
2
M
2
P
4
M
4
Figure 3. High-level application model. Each node is associ-
ated with a processor and a memory module.
A
L
R
A
L
R
Scheduler
Request
generator
Memory Processor
Arbiter
Nonblocking
switch
I
n
g
r
e
s
s
E
g
r
e
s
s
Figure 4. Node model. Each node has a request generator, processor, mem-
ory, and three ingress and three egress ports. Switch arbitration is through
the central scheduler.
and
tot
, average number of
packets the system can service
concurrently. Octagon has
significantly higher maxi-
mum throughput than both
the bus and crossbar. We
obtain similar results for
xed-size packets.
As these results show, the
bus saturates at
tot
= 1
because a single server (the
bus bandwidth) can provide
at most one service unit per
time unit. For the crossbar,
we assume a single queue per
crossbar nodehence we can
model the crossbar as a sys-
tem of eight queues sharing
eight servers. Contention and
head-of-line blocking mean
that the eight available servers
provide approximately four
work units per time unit.
12
Therefore, crossbar saturation
occurs at
tot
4. We could
have considered eight queues
per crossbar node, but the
implementation cost would
be prohibitive.
On the other hand, it
might be reasonable for each
Octagon node to have three
queues. Hence we model the
Octagon architecture as a sys-
tem of 24 queues and 24
servers (three egress queues
and three outgoing links per
node). This means we incur
more cost per node for Octa-
gon than for the crossbar.
However, saturation for Octagon occurs at
tot
12, which is signicantly higher than for a
crossbar, as Figure 5 shows. Note that the
effective server utilization is about 50 percent
(12), the same as for the crossbar.
Some packet service approaches achieve high
throughput by compromising service latency.
That is, system efciency and throughput
increase as the workload at each queue builds.
Figure 6 shows that at relatively high utiliza-
tion of
tot
= 12 and at 10
-4
packet loss proba-
bility, a system using an Octagon architecture
requires fewer than 50 packet buffers. Thus,
the average queue occupancy is not excessive.
For Octagons best-t connection schedul-
ing, a node (process) is not blocked if the
scheduler cannot schedule its communication
request immediately. Instead, the requesting
node queues the request in its egress queue.
This strategy can improve system performance
and node utilization more than some com-
munication protocols (especially most bus pro-
tocols), which stall the requesting node until its
request can be granted. However, each Octa-
gon node must have a queue large enough to
avoid packet loss. Figure 6 shows the packet
41 SEPTEMBEROCTOBER 2002
0
10
20
30
40
50
60
70
Utilization,
tot
L
a
t
e
n
c
y

(
c
l
o
c
k

c
y
c
l
e
s
)
0.1 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.0 12.0 13.0 14.0
Octagon
best-fit
scheduler
Crossbar
Bus
Figure 5. Throughput comparison of Octagon, a bus, and a crossbar for randomized packet
1.0
0.1
0.01
0.001
0.0001
P
a
c
k
e
t

l
o
s
s

p
r
o
b
a
b
i
l
i
t
y
5 10 15 20
Egress queue size
25 30 35
Figure 6. Packet loss probability versus egress queue size for a system using the Octagon
architecture.
loss probability as the size of each nodes egress
queue increases. Note that a queue size of 25
results in a nominal packet loss of 0.1 percent,
while a queue size of 35 reduces the packet loss
probability to 0.01 percent. If needed, a sys-
tem designer can enable a zero packet loss guar-
antee in Octagon by having the packet
scheduler refuse requests if the egress queue is
at full or near-full capacity, thereby stalling the
requesting node (as many existing buses do).
Scalability
The increasing performance demands of
programmable network processors makes scal-
ability an important factor in Octagons
design. Next-generation network processors
will likely have 16 or more processors and
many distributed memory components hold-
ing tables for Internet protocol lookup and
classification. The need for interconnecting
greater numbers of on-chip components in
network processors and other SOCs will accel-
erate in the foreseeable future, increasing the
need for scalable, on-chip communication
architectures.
Strategy 1: Low wiring complexity
One of the Octagon architectures strengths
is its ability to scale linearly. Figure 7a shows
a scaling strategy that requires two different
node types: bridge and member. As the name
implies, bridge nodes connect adjacent
Octagons and perform hierarchical packet
routing (for example, node y in Figure 7a).
Member nodes attach to only one Octagon
(node x, for example). Consider a network
consisting of eight interconnected Octagons.
The Octagon address eld of each packet is 6
bits wide: three high-order bits to identify the
local Octagon and three low-order bits to
identify the node within the Octagon. Each
bridge node performs static routing based on
the entire eld, and each member node per-
forms routing based only on the three low-
order bits.
Figure 7 also shows the estimated wiring cost
as a function of increasing network size for the
Octagon and crossbar architectures. To arrive
at these estimates, we assumed the SOC lay-
out in Figure 2a for each octagon. We extend-
ed the layout to the conguration of Figure 2b
for crossbars with more than eight nodes.
Octagon scales linearly and the crossbar does
not because in the crossbar, every node is wired
to every other node. As the number of nodes
N grows, the number of required wires is of
42
NETWORK SOC COMMUNICATIONS
IEEE MICRO
O
c
t
a
g
o
n
2
1
a
b
c
2
4
6 (shown)
y x
Nodes
Horizontal links
no./ max length
(mm)
Vertical links
no./ max length
(mm)
12/0.156
24/0.156
36/0.156
12/8
24/8
36/8
8
15
22 (shown)
Maximum
distance
(hops)
C
r
o
s
s
b
a
r
1
1
1
Nodes
Horizontal links
no. /max length
(mm)
Vertical links
no. /max length
(mm)
32/0.108
120/0.192
242/0.276
8/8
15/16
22/11
8
15
22
Maximum
distance
(hops)
Figure 7. Scaling strategy 1. Bridge nodes (node y) connect adjacent Octagons (a) and perform hierarchical packet routing.
Member nodes attach to only one Octagon (node x). The tables give the maximum distance for Octagon (b) and crossbar (c)
networks of various sizes.
(a)
(b)
(c)
order O(N
2
). In this Octagon scaling strategy,
on the other hand, each node requires either
three or six wires to its neighbors, resulting in
wiring complexity of O(cN).
The tables in Figure 7 shows the maximum
distance (the maximum number of hops
between any two nodes) for networks of var-
ious sizes. The maximum distance increases
linearly as Octagon grows, while the crossbar
has a constant maximum distance irrespective
of the number of nodes. Although a higher
maximum distance can degrade performance,
the performance results in Figure 5 indicate
that this might not always be the case: An 8-
node Octagon with maximum distance 2 per-
forms better than an 8-node crossbar with
maximum distance 1.
In this strategy, the maximum distance
between nodes grows much more slowly, but
it does not remain constant as for the crossbar.
This is fine for SOCs where low wire com-
plexity is the dominant consideration. How-
ever, this characteristic might not suit systems
where high throughput is the primary concern.
For example, consider a SOC with 15 nodes.
A bridge node connects Octagons 1 and 2. If
all network trafc is across networks, then a
bottleneck occurs at the bridge because it must
transmit all trafc from Octagon 1 over the
three links a, b, and c. Therefore, it can con-
currently transfer at most three packets in each
direction, one measure of a communication
architectures maximum throughput.
Strategy 2: High performance
For systems in which high performance is
the dominant consideration, we propose a sec-
ond scaling strategy that performs better than
the rst but has more complex wiring. In this
strategy, we extend Octagon to multidimen-
sional space. Figure 8a illustrates this scaling
strategy in a 64-node Octagon. We index each
SOC node by the 2-tuple (i, j), i, j [0, 7].
For each i = I, I [0, 7], we construct an Octa-
gon using nodes {(I, j), j [0, 7]}, which
results in eight individual Octagon structures.
We then connect these Octagons to each other
by linking corresponding i nodes according to
the Octagon conguration. That is, each node
(I, J) belongs to two Octagons: one consisting
of nodes {(I, j) j [0, 7]}, and the other con-
sisting of nodes {(i, J) i [0, 7]}. The table in
Figure 8 indicates the increase in wiring com-
plexity (number of links) and maximum dis-
tance (number of hops) as the number of
connected nodes increases. The maximum dis-
tance between nodes scales much better under
strategy 2 than strategy 1. However, strategy
2s better performance scalability comes at the
cost of greater wiring complexity.
Figure 9 (next page) shows a natural scaling
approach for networks with fewer nodes, and
therefore fewer Octagons to connect (each
node represents an Octagon). We scale the net-
work as follows. To connect two Octagons, we
construct a link between corresponding nodes;
this bidirectional link connects node i from
43 SEPTEMBEROCTOBER 2002
Nodes
8
64 (shown)
512
Links
12
192
2,304
Maximum
distance (hops)
2
4 (shown)
6
Figure 8. Scaling strategy 2. We extend the Octagon to the multidimensional space by linking corresponding nodes in adja-
cent Octagons according to the Octagon conguration (a). The table (b) indicates the increased wiring complexity (number of
links), and the greater maximum distance needed as the number of connected nodes increases.
(a) (b)
Octagon 1 to node i from Octagon 2. As the
number of Octagons to be connected increas-
es, we link corresponding nodes according to
the Octagon rule. That is, to connect eight
Octagons, 12 bidirectional links are needed to
connect each node i of Octagons 0, 1, 2, ,
7. For a network with more nodes, we start
adding links in a new dimension. As Figure 9
shows, by increasing the network size we can
maintain the maximum hop count at three
while increasing the number of nodes to 32.
These nodes represent a network of four
Octagons with two hops to the corresponding
intra-Octagon node and one hop to the desti-
nation Octagon.
Figure 10 shows Octagons advantage over
the crossbar architecture as the number of
nodes increases. Although the crossbars
implementation cost (measured in number of
links) increases prohibitively with increasing
nodes, the Octagon scaling strategies make
scaling feasible.
O
ur analysis shows that Octagon signi-
cantly outperforms shared bus and
crossbar on-chip communication architec-
tures in terms of performance, implementa-
tion cost, and scalability. We are currently
investigating the use of Octagon to satisfy the
on-chip communication needs of other appli-
cation-specic multiprocessor SOCs. MI CRO
Acknowledgments
We acknowledge Naresh Soni for his
encouragement and support, and Razak Hos-
sain for his help on physical layout and imple-
mentation issues.
References
1. V.P. Kumar, T.V. Lakshman, and D. Stiliadis,
Beyond Best Effort: Router Architectures
for the Differentiated Services of Tomor-
rows Internet, IEEE Comm., vol. 36, no. 5,
May 1998, pp. 152-164.
2. T.V. Lakshman and D. Stiliadis, High-Speed
Policy-Based Packet Forwarding Using Ef-
cient Multidimensional Range Matching,
Proc. ACM SIGCOMM, ACM Press, New
York, 1998, pp. 191-202.
3. K. Lahiri, A. Raghunathan, and S. Dey, Eval-
uation of the Traffic Performance Charac-
teristics of System-on-Chip Communication
Architectures, Proc. 14th Intl Conf. VLSI
Design, IEEE CS Press, Los Alamitos, Calif.,
2001, pp. 29-35.
4. J.A. Rowson and A. Sangiovanni-Vincentel-
li, Interface-Based Design, Proc. 34th Ann.
44
NETWORK SOC COMMUNICATIONS
IEEE MICRO
Figure 9. Growing the network using scaling strategy 2.
1 5 9 13 17 21 25 29 33 37 41 45 49 57 53
Number of nodes
4,500
4,000
3,500
3,000
2,500
2,000
1,500
1,000
500
0
N
u
m
b
e
r

o
f

l
i
n
k
s
Crossbar
Strategy 1
Strategy 2
Figure 10. Comparison of Octagon scaling strategies to a crossbar architecture: number of
nodes versus wiring complexity.
Design Automation Conf., ACM Press, New
York, 1997, pp. 178-183.
5. R.B. Ortega and G. Borriello, Communica-
tion Synthesis for Distributed Embedded
Systems, Proc. Intl Conf. Computer-Aided
Design (ICCAD 98), IEEE CS Press, Los
Alamitos, Calif., 1998, pp. 437-444.
6. M. Gasteier and M. Glesner, Bus-Based
Communication Synthesis on System Level,
ACM Trans. Design Automation Electronic
Systems, vol. 4, no. 1, Jan. 1999, pp. 1-11.
7. K. Lahiri, A. Raghunathan, and S. Dey,
Communication Architecture Tuners: A
Methodology for the Design of High-Perfor-
mance Communication Architectures for
System-on-Chips, Proc. 37th Design
Automation Conf., ACM Press, New York,
2000, pp. 513-518.
8. K. Lahiri, A. Raghunathan, and S. Dey, Ef-
cient Exploration of the SOC Communication
Architecture Design Space, Proc. Intl Conf.
Computer-Aided Design (ICCAD 00), IEEE CS
Press, Los Alamitos, Calif., 2000, pp. 424-430.
9. Sonics Integration Architecture, www.
sonicsinc.com.
10. D. Flynn, AMBA: Enabling Reusable On-
Chip Designs, IEEE Micro, vol. 7, no. 4,
July-Aug. 1997, pp. 20-27.
11. A. Charlesworth, The Sun Fireplane Inter-
connect, IEEE Micro, vol. 22, no. 1, Jan.-
Feb. 2002, pp. 36-45.
12. J. Chen and T. Stern, Throughput Analysis,
Optimal Buffer Allocation, and Trafc Imbal-
ance Study of a Generic Nonblocking Pack-
et Switch, IEEE J. Selected Areas in
Comm., vol. 9, no. 3, Apr. 1991, pp. 439-449.
13. D. Gross and C. Harris, Fundamentals of
Queueing Theory, 3rd ed., John Wiley &
Sons, New York, 1998, pp. 297-300.
Faraydon Karim is an ST Fellow at STMicro-
electronics Advanced System Technology,
Advanced Computing Lab in La Jolla, Cali-
fornia. His research interests include comput-
er and embedded system architecture. Karim
has a PhD in computer engineering from La
Salle University. He is a member of the IEEE.
Anh Nguyen is a research engineer at the
STMicroelectronics Advanced Systems Tech-
nology, Advanced Computing Lab in La Jolla,
California. His research interests include per-
formance analysis, resource allocation, and
algorithm development. Nguyen has a PhD
in electrical engineering from the University
of California, Los Angeles. He is a member of
the IEEE.
Sujit Dey is a professor in the Electrical and
Computer Engineering Department at the
University of California, San Diego. His
research interests include configurable plat-
forms consisting of adaptive wireless proto-
cols and algorithms, and deep-submicron
adaptive SOCs for next-generation wireless
appliances and network infrastructure devices.
Dey has a PhD in computer science from
Duke University. He is a member of the IEEE.
Direct questions and comments to Faray-
don Karim at STMicroelectronics, Advanced
System Technology, San Diego, CA 92121;
[email protected].
For further information on this or any other
computing topic, visit our Digital Library at
http://computer.org/publications/dlib.
45 SEPTEMBEROCTOBER 2002
Join a community that targets your discipline.
In our Technical Committees, youre in good company.
computer.org/TCsignup/
L
ooking for a community targeted to your
area of expertise? Computer Society
Technical Committees explore a variety
of computing niches and provide forums for
dialogue among peers. These groups inuence
our standards development and offer leading
conferences in their elds.
JOIN A
THINK
TANK

You might also like