An_interconnect_architecture_for_networking_systems_on_chips
An_interconnect_architecture_for_networking_systems_on_chips
FOR NETWORKING
SYSTEMS ON CHIPS
NETWORK PROCESSOR SYSTEMS ON CHIPS MEET THE SPEED AND FLEXIBILITY
REQUIREMENTS OF NEXT-GENERATION INTERNET ROUTERS. THE OCTAGON ON-
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply
High-performance communications Interface Alliance’s efforts
Consider the on-chip communication have focused on bus interface 0
requirements typical network processor appli- standards.
cations impose. Using T.V. Lakshman and D. Although bus-based on- 7 1
Stiliadis’ packet classification algorithm with chip communication might
an estimated 10,000 classification rules and be suitable for many applica-
16-bit on-chip memory width,2 we must per- tions, it clearly cannot satisfy
form 625 memory accesses per packet arrival, an OC-768 network proces- 6 2
or 71.3 × 109 memory accesses per second (in sor’s very demanding on-chip
the worst case). Clearly, this necessitates the communication needs. For
use of multiple memory components, and an high-performance computing
on-chip communication architecture that systems, interconnect archi- 5 3
enables highly concurrent, high-speed com- tectures based on crossbars, or
4
munication between the multiple processor crossbars mixed with buses,
and memory components. deliver the ultrahigh-perfor-
Recent studies have demonstrated the sig- mance communication need- Figure 1. Basic Octagon configuration
nificant role an on-chip communication archi- ed among components.11 includes eight nodes and 12 bidirectional
tecture plays in determining a SOC’s overall Many switching architec- links.
performance.3 Several techniques let us design tures in high-performance
and synthesize on-chip communication to sat- routers also use crossbars.1 An
isfy components’ interface and communica- on-chip crossbar can satisfy the on-chip com-
tion needs in an application-specific system.4–6 munication needs of an OC-768 network
However, because one of a network processor’s processor SOC. Theoretically, crossbar per-
primary goals is to efficiently execute multiple formance (in terms of throughput or delay) is
applications (including evolving networking high enough to permit development of effi-
applications), synthesizing an application-spe- cient network processing tasks.12 In reality,
cific interconnect architecture for a network crossbar implementation costs are high: Cross-
processor SOC will not work. bars require many on-chip wires and relays to
Rather than synthesize a custom on-chip minimize clock skew across the chip. In addi-
interconnect architecture for a given applica- tion, crossbars do not scale well as the num-
tion, K. Lahiri and colleagues propose to opti- ber of nodes to be connected increases.
mally map a system’s communication Although crossbar-based interconnects might
requirements to a given communication archi- be justified in high-performance computing
tecture.7 In other work, they describe a tech- systems and routers, they might not be the best
nique that allows reconfiguration of the economic choice for lower-cost and higher-
selected communication architecture’s proto- volume network processor SOCs.
cols according to the application’s changing
communication demands.8 Proposed com- Octagon
munication mapping and reconfiguration The Octagon on-chip architecture is sim-
techniques provide up to an order of magni- pler to implement than a crossbar yet has
tude improvement in system performance. much higher throughput than either shared
Taken together, mapping and reconfiguration buses or traditional crossbars. Unlike cross-
techniques show promise for efficiently map- bars, Octagon’s implementation complexity
ping multiple applications to the same inter- increases linearly with the number of nodes—
connect fabric.7,8 For these techniques to be processor or memory components—that the
successful, however, developers must select the network must connect.
appropriate on-chip communication archi-
tecture. Architecture
Despite recent advances in the analysis and As Figure 1 shows, a basic Octagon unit
design of high-performance on-chip commu- consists of eight nodes and 12 bidirectional
nication architectures, commercial SOCs links.
commonly use simple bus-based topologies The Octagon architecture has several desir-
and protocols.9,10 Even the Virtual Socket able properties:
SEPTEMBER–OCTOBER 2002 37
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply
NETWORK SOC COMMUNICATIONS
0 1 2 3
0 1 2 3
7 6 5 4 7 6 5 4
(a) (b)
Figure 2. Octagon (a) and crossbar (b) physical-layout schematic examples. Octagon consists
of 12 horizontal and 12 vertical 32-bit tracks with each horizontal track upper-bounded by 8
mm, and each vertical track upper-bounded by 0.156 micron (13 x 12 micron). The crossbar
has eight horizontal and 32 vertical 32-bit tracks, with the horizontal tracks upper-bounded by
8 mm as in the Octagon, and the vertical tracks upper-bounded by 0.108 micron (9 x 12
38 IEEE MICRO
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply
Octagon architecture consists of 12 horizon- age amount of service demand arriving with-
tal and 12 vertical 32-bit tracks. Each hori- in one time unit. The aggregated arrival rate
zontal track is upper-bounded by 8 mm, the is λtot = Sij λ ij. Total utilization ρtotal = λ tot /µ.
total width of the four nodes. Each vertical We model the shared bus as a single server
track is upper-bounded by 0.156 micron (13 queue with Poisson arrivals and exponential
× 12 micron). As Figure 2b shows, the cross- service time. Consider the aggregated request
bar needs eight horizontal and 32 vertical 32- arrival process for the eight nodes. Because the
bit tracks. Although the horizontal tracks are individual arrival process is a Poisson distrib-
upper-bounded by 8 mm as in the Octagon, ution, the superposition is also a Poisson dis-
the vertical tracks are upper-bounded by tribution.12 In memory access applications,
0.108 micron (9 × 12 micron). Thus, wiring service rate corresponds to memory access
in Octagon is less complex than in a crossbar. speed. We ignore all propagation delays. The
server serves queued requests in first-in, first-
Octagon versus crossbars and buses out (FIFO) order. An arriving request’s
Consider a typical SOC communication. response time is the difference between the
Node processes continuously generate requests arrival time and time the bus completes the
for service; examples include memory read and service. Because the server is work conserv-
write requests. We classify requests according ing,13 the expected response time, denoted by
to their source-destination pair; thus, we denote EWbus, for an arbitrary request arriving at the
requests that originate from node i with desti- single server queue is identical to that for a
nation node j as type ij. Requests of type ij single-server multiple-queue system. The
arrive at the system following the Poisson expected response time of a shared bus mod-
process with parameter λij, which is the eled by a single-server queue is
requests’ arrival rate. Service time is the time a
destination node requires to complete all ρtot
EW bus = . 13
λ tot (1 − ρtot )
requested tasks if it processes the request in iso-
lation. We assume the communication links
between source and destination nodes are
locked until service is completed. For crossbar throughput, we use the model
presented by J. Chen and T. Stern.12 They
Service and response times showed that for a large switch (approximate-
The required service time is equivalent to ly 20 nodes) having a speedup factor of one,
packet size or link rate, where packet size is response time is
the number of bytes of data transfer that result λ tot E 2W s
from the communication request, and link EW xbar = EW s + ,
2(8 − λ tot EW s )
rate is the communication link’s data transfer
8λ tot 1
rate. We assume that the service time for where EW s = po + ,
request ij is exponentially distributed with ( 8 µ − λ )2 µ
tot
SEPTEMBER–OCTOBER 2002 39
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply
NETWORK SOC COMMUNICATIONS
Egress
40 IEEE MICRO
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply
and ρtot, average number of
70
packets the system can service Octagon
concurrently. Octagon has 60 best-fit
significantly higher maxi- scheduler
SEPTEMBER–OCTOBER 2002 41
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply
NETWORK SOC COMMUNICATIONS
Octagon
8 12/8 12/0.156 2
a 15 24/8 24/0.156 4
b
y x
c 22 (shown) 36/8 36/0.156 6 (shown)
(b)
Crossbar
8 8/8 32/0.108 1
15 15/16 120/0.192 1
22 22/11 242/0.276 1
(c)
(a)
Figure 7. Scaling strategy 1. Bridge nodes (node y) connect adjacent Octagons (a) and perform hierarchical packet routing.
Member nodes attach to only one Octagon (node x). The tables give the maximum distance for Octagon (b) and crossbar (c)
networks of various sizes.
loss probability as the size of each node’s egress is its ability to scale linearly. Figure 7a shows
queue increases. Note that a queue size of 25 a scaling strategy that requires two different
results in a nominal packet loss of 0.1 percent, node types: bridge and member. As the name
while a queue size of 35 reduces the packet loss implies, bridge nodes connect adjacent
probability to 0.01 percent. If needed, a sys- Octagons and perform hierarchical packet
tem designer can enable a zero packet loss guar- routing (for example, node y in Figure 7a).
antee in Octagon by having the packet Member nodes attach to only one Octagon
scheduler refuse requests if the egress queue is (node x, for example). Consider a network
at full or near-full capacity, thereby stalling the consisting of eight interconnected Octagons.
requesting node (as many existing buses do). The Octagon address field of each packet is 6
bits wide: three high-order bits to identify the
Scalability local Octagon and three low-order bits to
The increasing performance demands of identify the node within the Octagon. Each
programmable network processors makes scal- bridge node performs static routing based on
ability an important factor in Octagon’s the entire field, and each member node per-
design. Next-generation network processors forms routing based only on the three low-
will likely have 16 or more processors and order bits.
many distributed memory components hold- Figure 7 also shows the estimated wiring cost
ing tables for Internet protocol lookup and as a function of increasing network size for the
classification. The need for interconnecting Octagon and crossbar architectures. To arrive
greater numbers of on-chip components in at these estimates, we assumed the SOC lay-
network processors and other SOCs will accel- out in Figure 2a for each octagon. We extend-
erate in the foreseeable future, increasing the ed the layout to the configuration of Figure 2b
need for scalable, on-chip communication for crossbars with more than eight nodes.
architectures. Octagon scales linearly and the crossbar does
not because in the crossbar, every node is wired
Strategy 1: Low wiring complexity to every other node. As the number of nodes
One of the Octagon architecture’s strengths N grows, the number of required wires is of
42 IEEE MICRO
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply
Maximum
Nodes Links
distance (hops)
8 12 2
512 2,304 6
(a) (b)
Figure 8. Scaling strategy 2. We extend the Octagon to the multidimensional space by linking corresponding nodes in adja-
cent Octagons according to the Octagon configuration (a). The table (b) indicates the increased wiring complexity (number of
links), and the greater maximum distance needed as the number of connected nodes increases.
SEPTEMBER–OCTOBER 2002 43
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply
NETWORK SOC COMMUNICATIONS
4,500
Strategy 1
4,000 Strategy 2 Crossbar
3,500
Number of links
3,000
2,500
2,000
1,500
1,000
500
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57
Number of nodes
44 IEEE MICRO
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply
Design Automation Conf., ACM Press, New algorithm development. Nguyen has a PhD
York, 1997, pp. 178-183. in electrical engineering from the University
5. R.B. Ortega and G. Borriello, “Communica- of California, Los Angeles. He is a member of
tion Synthesis for Distributed Embedded the IEEE.
Systems,” Proc. Int’l Conf. Computer-Aided
Design (ICCAD 98), IEEE CS Press, Los Sujit Dey is a professor in the Electrical and
Alamitos, Calif., 1998, pp. 437-444. Computer Engineering Department at the
6. M. Gasteier and M. Glesner, “Bus-Based University of California, San Diego. His
Communication Synthesis on System Level,” research interests include configurable plat-
ACM Trans. Design Automation Electronic forms consisting of adaptive wireless proto-
Systems, vol. 4, no. 1, Jan. 1999, pp. 1-11. cols and algorithms, and deep-submicron
7. K. Lahiri, A. Raghunathan, and S. Dey, adaptive SOCs for next-generation wireless
“Communication Architecture Tuners: A appliances and network infrastructure devices.
Methodology for the Design of High-Perfor- Dey has a PhD in computer science from
mance Communication Architectures for Duke University. He is a member of the IEEE.
System-on-Chips,” Proc. 37th Design
Automation Conf., ACM Press, New York, Direct questions and comments to Faray-
2000, pp. 513-518. don Karim at STMicroelectronics, Advanced
8. K. Lahiri, A. Raghunathan, and S. Dey, “Effi- System Technology, San Diego, CA 92121;
cient Exploration of the SOC Communication [email protected].
Architecture Design Space,” Proc. Int’l Conf.
Computer-Aided Design (ICCAD 00), IEEE CS For further information on this or any other
Press, Los Alamitos, Calif., 2000, pp. 424-430. computing topic, visit our Digital Library at
9. Sonics Integration Architecture, www. http://computer.org/publications/dlib.
sonicsinc.com.
10. D. Flynn, “AMBA: Enabling Reusable On-
Chip Designs,” IEEE Micro, vol. 7, no. 4,
JOIN A
July-Aug. 1997, pp. 20-27.
11. A. Charlesworth, “The Sun Fireplane Inter-
connect,” IEEE Micro, vol. 22, no. 1, Jan.-
Feb. 2002, pp. 36-45.
THINK
12. J. Chen and T. Stern, “Throughput Analysis,
Optimal Buffer Allocation, and Traffic Imbal-
ance Study of a Generic Nonblocking Pack-
et Switch,” IEEE J. Selected Areas in
TANK
Comm., vol. 9, no. 3, Apr. 1991, pp. 439-449.
13. D. Gross and C. Harris, Fundamentals of
Queueing Theory, 3rd ed., John Wiley &
Sons, New York, 1998, pp. 297-300.
L
ooking for a community targeted to your
Faraydon Karim is an ST Fellow at STMicro- area of expertise? Computer Society
electronics’ Advanced System Technology, Technical Committees explore a variety
of computing niches and provide forums for
Advanced Computing Lab in La Jolla, Cali-
dialogue among peers. These groups influence
fornia. His research interests include comput- our standards development and offer leading
er and embedded system architecture. Karim conferences in their fields.
has a PhD in computer engineering from La
Salle University. He is a member of the IEEE.
Join a community that targets your discipline.
Anh Nguyen is a research engineer at the
STMicroelectronics Advanced Systems Tech- In our Technical Committees, you’re in good company.
nology, Advanced Computing Lab in La Jolla,
California. His research interests include per- computer.org/TCsignup/
formance analysis, resource allocation, and
SEPTEMBER–OCTOBER 2002 45
Authorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on August 25,2023 at 17:59:41 UTC from IEEE Xplore. Restrictions apply