Fast Scalable FPGA-Based Network-on-Chip Simulation Models: Roblem Escription
Fast Scalable FPGA-Based Network-on-Chip Simulation Models: Roblem Escription
Michael K. Papamichael
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA, USA
Email: [email protected]
I. I NTRODUCTION
The objective of the 2011 MEMOCODE Hardware/Software Codesign Contest was to build the fastest
simulator for a class of simple Networks-on-Chip (NoCs)
that precisely replicates the cycle-by-cycle behavior of a
given software reference simulator. Our FPGA-based submission won the Absolute Performance category providing
up to three orders of magnitude speedup over the software
reference design on a Xilinx ML605 FPGA development
board.
The contest reference design supported a large number
of design parameters, which led to a very large design
space consisting of different router configurations, network
topologies and traffic patterns. To effectively cover this vast
design space and at the same time stay within the resource limitations of our FPGA development platform, we
implemented two network simulation designs: i) a highperformance direct-mapped design that laid out the entire
simulated target network on the FPGA and ii) a virtualized
time-multiplexed design used to efficiently simulate larger
network configurations that would not fit using the directmapped approach.
This paper describes our contest submission and is organized as follows. Section II describes the problem in more
detail and Section III outlines the design principles we adhered to when developing our contest submission. Section IV
Commands
MicroBlaze
Host PC
Figure 1.
Results
NoC Simulator
(direct-mapped
or virtualized)
High-level block diagram of platform consisting of Host PC connected to a Xilinx ML605 Development Board.
board [3]. The FPGA on the ML605 hosts the NoC simulation engine and a MicroBlaze processor. Both the directmapped and virtualized implementations of the NoC simulator expose a common FIFO-based interface for accepting
initialization commands from and streaming out simulation
results to the MicroBlaze. Since the MicroBlaze and NoC
simulator might run at different clock frequencies, the FIFOs
between them are asynchronous to allow crossing between
the two different clock domains.
Running Simulations. To setup the FPGA for a given
NoC configuration, the Bluespec compiler is invoked to generate the Verilog code for the set of parameters specified in
the NoC configuration. The produced Verilog code is then
fed to the Xilinx XST synthesis tool and the resulting netlist
is then connected to the MicroBlaze processor as a peripheral on the PLB bus [4]. Once the FPGA is configured,
scripts are used to convert each traffic pattern to MicroBlaze
code that will initialize the traffic tables for each router in the
network along with other simulation parameters. Since the
traffic tables are allowed to contain hundreds of thousands
of entries, the initialization data can grow very large and is
thus stored in off-chip DRAM.
Once the MicroBlaze has initialized all of the traffic and
routing tables through the Commands FIFO, a final command is sent that triggers the traffic sources and starts the
simulation. The MicroBlaze then starts polling the Results
FIFO until the NoC simulator detects that the simulation has
terminated either because the traffic is done or because the
maximum number of cycles has elapsed at which point
the number of simulated cycles along with other statistics
are enqueued in the Results FIFO and then printed by the
MicroBlaze through the serial port.
V. A RCHITECTURE AND I MPLEMENTATION
In order to efficiently cover the vast design space of
different possible NoCs and stay within the resource limits of the ML605 FPGA platform, our contest submission consists of two separate NoC simulation engines: i) A
high-performance direct-mapped simulation engine that supports up to moderately sized networks (100 routers) with
medium-complexity routers (e.g. 5 ports w/ 4VCs) and ii) a
highly scalable virtualized simulation engine that can handle
the entire design space including the largest network/router
A. Direct-Mapped Implementation.
In a direct-mapped implementation the network is build as
a collection of router instances that are connected according
to each NoC configuration. Figure 2 shows the architectural
block diagram of a single such router. Each router module
receives flits through a set of input ports and sends flits
through a set of output ports. The first input port of each
router is connected to a traffic source that injects packets
according to a traffic pattern table that is populated during
initialization. Similarly, the first output port is connected to a
traffic sink that drains packets once they have reached their
destination. The remaining input and output ports of each
router are either used to create links with other routers in
the network or may remain unconnected. For each flit link
connecting two neighboring routers there is a corresponding
credit link going in the opposite direction for flow-control.
1 Throughout the rest of this paper the term host will be used to refer to
the system on which the network simulator is executed and the term target
will be used to refer to the network that is being simulated.
Router
In Ports
Traffic
Table
Flit Buffers
In0
In1 (flits)
In1 (credits)
In4 (flits)
In4 (credits)
Out Ports
Switch
VC 0
VC 1
VC 0
VC 1
VC 0
VC 1
Arbitration
Source
Routing
Arbitration
16 BRAMs
Out0
Sink
Out1 (flits)
Out1 (credits)
Out4 (flits)
Out4 (credits)
Figure 2.
Router State
Virtual
Sources
Traffic Table
Flit Buffers
Credits
Route Tables
Scheduler State
Virtualized Router
Router Logic
Virtual Links
Credit Links
Delay
Flit
FlitLinks
Links
Flit/Credit Conn. Table
Figure 3.
engine.
2 VCs
4 VCs
8 VCs
3050 / 66
4117 / 56
6346 / 34
8 in/out ports
7912 / 35
11833 / 28
28859 / 17
12 in/out ports
13653 / 30
28461 / 16
48081 / 10
16 in/out ports
30399 / 17
52288 / 12
101500 / 7
Table II
VI. R ESULTS
To first get a sense of how the two presented NoC simulator implementations scale in terms of FPGA resource usage and clock frequency, we present FPGA synthesis results
for both the direct-mapped and virtualized simulator implementations. We then show more detailed results for the five
specific networks that were used in the contest validation.
TODO: Finally we present with a brief case study that looks
at one of these five networks in more depth.
Direct-mapped Implementation Results. As mentioned
earlier, the direct-mapped implementation of our NoC simulator is a collection of interconnected router modules. Table I
shows FPGA resource usage and clock frequency synthesis
results for different router configurations targetting a Xilinx
Virtex-6 LX760T FPGA. All reported results are for a single router within a 256-node network, the largest network
allowed in the contest. As expected, increasing the number
of router ports and VCs leads to higher LUT counts and
negatively impacts clock frequency.
LUTs / Clock Frequency (in MHz)
Router Config.
2 VCs
4 VCs
8 VCs
4 in/out ports
785 / 152
1393 / 101
2848 / 59
8 in/out ports
3243 / 81
6134 / 54
12754 / 33
12 in/out ports
7717 / 62
11596 / 36
19198 / 20
16 in/out ports
11655 / 45
28294 / 30
33689 / 14
Table I
S YNTHESIS RESULTS FOR SINGLE ROUTER IN DIRECT- MAPPED DESIGN .
Routers
Ports/Router
VCs
butterfly
112
Credit Delay
1
highradix
16
16
15
mesh
253
torus
252
hypercube
256
Table III
C ONFIGURATION OF CONTEST NETWORKS .
Xilinx LX760T
Network
DM/V
%LUTs
Speedup
DM/V
%LUTs
Speedup
butterfly
DM
86%
1511
DM
27%
2330
highradix
63%
DM
93%
421
mesh
3%
28
DM
96%
4281
torus
8%
7862
2%
7892
hypercube
8%
21
2%
33
FPGA, an interesting extension to this work would be building a flexible configurable NoC generator. Such a tool could
prove useful to FPGA designers that need an FPGA-friendly
NoC that is custom-built to meet the specific needs of their
application. In fact, a heavily modified version of the directmapped NoC code base is currently used as the interconnect
within the CoRAM project [8].
VIII. ACKNOWLEDGMENTS
Table IV
I MPLEMENTATION RESULTS FOR CONTEST NETWORKS .
VII. D ISCUSSION
R EFERENCES
Multiple Virtualized Routers. Even though our virtualized simulation engine can scale to very large network and
router configurations, this scalability comes at the cost of
lower performance compared to the direct-mapped approach.
To bridge this gap, one idea for future work is to use multiple
virtualized routers that run concurrently. To maintain proper
event ordering in such a setting, the system needs to ensure
that only independent (i.e. not neighboring) sets of routers
are simulated at the same time. This issue has been studied
in previous work [5] and a straightforward way to resolve
it would be through a separate preprocessing step that identifies independent sets of routers in the network and then
generates a fixed valid simulation schedule.
An Alternative Approach to FPGA-based NoC simualtion. Another interesting approach to FPGA-friendly NoC
simulation is FIST [7], a simulation technique previously
explored by our group that abstractly models each router
as a set of load-delay curves, which are obtained through
training using a software-based cycle-accurate NoC simulator. In addition to high simulation speed and scalability, an
important benefit of such an approach is reduced implementation complexity. In contrast to the two NoC simulation
approaches presented in this paper, FIST does not require
implementing the actual router in hardware; instead it relies
on the presence of a software-based model that will be used
for training purposes.
Automatic Network Generation. Given that the directmapped design is already fully parameterized and essentially
builds a working prototype of the target network on the