Exploring_and_optimizing_partitioning_of_large_des
Exploring_and_optimizing_partitioning_of_large_des
https://doi.org/10.1007/s00607-020-00834-5
REGULAR PAPER
Received: 18 January 2020 / Accepted: 10 July 2020 / Published online: 21 July 2020
© Springer-Verlag GmbH Austria, part of Springer Nature 2020
Abstract
Recently, multi-FPGA platforms have become a popular choice to prototype complex
digital systems. This is because of unique advantages such as high frequency and real
world testing experience that are offered when compared to other pre-silicon testing
techniques. However, one of several challenges faced by multi-FPGA prototyping is
the requirement of an efficient back end flow. Partitioning is a key part of the back
end flow of multi-FPGA systems and it directly affects the quality of final prototyped
design. In this work, we explore two different partitioning approaches: one is multi-
level; while the other is hierarchical partitioning approach. For experimentation, we
use a suite of fourteen large benchmarks. Experimental results reveal that the mul-
tilevel approach gives 12.5% better frequency results for mono-cluster benchmarks
while the hierarchical approach gives 13% better results for multi-cluster benchmarks.
Furthermore, the hierarchical approach requires, on average, 60% less execution time
when compared to the multilevel partitioning approach.
1 Introduction
Modern day System on Chip (SoC) designs have huge computation capability and they
are enormously complex to design. Moreover, shrinking product life cycle and faster
time-to-market pressures increase the need for an efficient, fault-free design process
B Umer Farooq
[email protected]
Bander A. Alzahrani
[email protected]
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2362 U. Farooq, B. A. Alzahrani
[1,2]. Because a faulty and inefficient design can cost a huge fortune [3,4]. In this
regard, FPGA-based prototyping offers a good option for complete design-to-silicon
system verification. FPGA-based prototyping is pre-silicon verification technique that
offers better speed as compared to simulation-based verification [5]. Simulation-based
solutions are cost-effective but they are very slow and offer only abstract level view
of the system. Although emulation-based pre-silicon verification gives good speed,
unique feature of FPGA-based prototyping is that it gives real-world testing and trouble
shooting experience to a user.
Prototyping of less complex Application Specific Integrated Circuit (ASIC) can be
performed on a single FPGA as the modern day FPGAs are quite capable and have huge
logic capacity. However, as the complexity of the system under consideration grows,
the capability of even the most modern FPGAs becomes insufficient to handle the
resource and I/O requirement of the ASIC. For such scenarios, multi-FPGA platforms
are required because the gap between FPGAs capability and ASIC requirement is huge
[6] and with every new processing technology, it is becoming increasingly difficult
to bridge this gap. Normally, the number of FPGAs required to prototype a design
depends upon the complexity of the design under consideration and this number may
vary from a few FPGAs to a couple of dozen FPGAs [7,8]. The prototyping of complex
ASIC designs using multi-FPGA platforms usually follows a complex back end flow
that involves several optimization steps. The core objective of this back end flow is
to optimize the frequency and the execution speed of the design under consideration.
The back end flow starts with the RTL description of the design. The design is first
synthesized and next partitioned using a partitioning algorithm. After partitioning, the
routing of the design is performed. Finally the flow is culminated in the intra-FPGA
placement and routing of the design.
Partitioning is one of the most critical steps of the multi-FPGA partitioning flow.
In this step, based on the number of FPGAs on multi-FPGA board, the design under
consideration is divided into multiple parts. Because of the several optimization con-
straints, finding an optimal partitioning solution is an NP hard problem [9]. When we
consider partitioning problem from multi-FPGA prototyping perspective, several con-
straints are associated with the partitioning of a complex design. Two of the principle
objectives of a partitioning tool are to respect the logic capacity of the target FPGA
architecture while keeping the communication between different partitions as small
as possible. Thanks to the improved design process and better processing technology,
both the logic capacity and number of I/Os of modern generations of FPGAs have
increased. However, the rate at which the logic capacity in FPGAs has increased is
much higher when compared to the rate of increase of number of I/Os. This trend has
led to an increased logic to I/O ratio in newer generations of FPGAs and it has become
particularly difficult for a partitioning tool to minimize the inter-partition commu-
nication. Thus, the number of signals (also termed as cut-nets) traversing different
partitions are more than the available I/Os between different FPGAs of a multi-FPGA
board. These signals are routed between different FPGAs through next step of the
prototyping flow which is called inter-FPGA routing.
Inter-FPGA routing follows the partitioning of the design under consideration and
also plays an important role in the overall optimization of the design under consid-
eration. In this step, the cut-nets of the partitioned design are routed on the tracks of
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2363
multi-FPGA board in a time division multiplexed (TDM) manner. So, higher value
of cut-nets will lead to a higher value of multiplexing ratio which in turn will reduce
the execution speed of the final prototyped design. The results produced by the rout-
ing tool are directly linked with the quality of preceding partitioning process. Even a
highly efficient routing tool cannot overturn the poor results of a partitioning tool. An
in depth discussion on the quality of partitioning tool and its impact on the frequency
of final prototyped design is presented in the subsequent sections of the paper.
It is evident from the discussion presented above that partitioning plays very impor-
tant role in the multi-FPGA prototyping flow. In this work, we propose and explore
two partitioning approaches, namely hierarchical and multilevel partitioning approach.
For this purpose, we use an open source flow. This flow gives complete experience
for the prototyping of multi-FPGA systems. However, our focus in this work remains
on the partitioning aspect of the flow. The flow proposed in this work starts with
the generation of large and complex benchmarks. These benchmarks are generated
using a generic academic tool that can generate both flat and hierarchical benchmarks.
The generated benchmarks are next logically synthesized using an open source tool
by VERIFIC [10]. This tool not only performs standard cell synthesis, it also gives
complete information about the interconnect of the design under consideration. After
synthesis, we perform partitioning of the design. In order to produce the best partition-
ing results, we strive to exploit the inherent interconnect patterns of the design under
consideration. For this purpose, we explore two different partitioning approaches. One
approach exploits the hierarchical interconnect which is inherent in certain designs. We
call this proposed approach as hierarchical partitioning approach. Second partitioning
approach that we use in this work performs partitioning using multilevel clustering and
refinement. We call this approach multilevel partitioning approach in this work. Both
proposed approaches are novel in the sense that they have been specifically customized
in the context of prototyping for multi-FPGA systems. An in-depth discussion on both
approaches is given in Sect. 4 of the paper. After partitioning, the inter-FPGA routing
of the design is performed. After routing, system frequency results are obtained for the
two partitioning approaches and a thorough analysis of those results is also presented.
Here, only a brief overview of different steps involved in the proposed back end flow
is given. Detailed discussion on these steps is given in the subsequent sections of the
paper. The main contributions of this paper are summarized as follows:
– Development of an open source, generic, back end flow for prototyping of multi-
FPGA systems. All the steps of the proposed flow either use open source tools or
the tools that are free for academia.
– Development and implementation of two partitioning approaches for exploration
and optimization of different designs in multi-FPGA prototyping.
– Extensive experimentation and thorough analysis of results obtained through the
proposed back end flow.
In the rest of the paper, Sect. 2 discusses the background and related work and also
elaborates the contribution of this work. Section 3 then gives a detailed discussion on
the proposed flow where comprehensive details of all the steps of back end flow are pre-
sented. Section 4 presents in-depth discussion on the two partitioning approaches that
we propose and explore in this work. Section 5 gives profound analysis of the results
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2364 U. Farooq, B. A. Alzahrani
obtained through experimentation and Sect. 6 concludes this paper with discussion on
the future work.
The discussion presented in Sect. 1 shows that partitioning plays a very important
role in determining the system frequency of final prototyped design. Partitioning is a
well formulated research problem and researchers have been active in this area since
1970s. Many techniques have been proposed in past to find the efficient solution of
partitioning problems. Mainly, there are three different types of techniques which are
used to find the solution of a partitioning problem.
1. Analytical partitioning technique [11,12] is commonly utilized where objective
function is to optimize the quadratic length of the critical path. Although min-
imizing the quadratic length of the critical path is only an indirect measure of
the partitioning solution, its main advantage is that the objective function can be
achieved in very small time. This kind of approach is particularly suitable for very
large problems. A quadratic function, however, does not give the best possible
solution and it is often followed by several local tweaks.
2. Simulated annealing based placement [13,14] is another technique that uses the
annealing concept for molten metal which is cooled down gradually to produce
high quality solutions. The objective function of this approach is to minimize the
overall Manhattan distance between all the connected instances. This approach
is quite effective in finding a reasonably good solution in a small amount of time.
This type of technique is commonly used for island style architectures. However,
simulated annealing technique is classified more as a placement technique rather
than a partitioning technique.
3. Min-Cut based partitioning approach [15,16] is generally suitable for partitioning
of complex designs. The min-cut partitioner recursively partitions the design
under consideration. The aim of the partitioner is to minimize the cut-nets of the
design by merging the connected instances in a single cluster. Because of the
ability to find a good solution in small time, in this work, we mainly consider
min-cut based partitioning algorithms. Further discussion on different min-cut
based partitioning algorithms is given next.
In min-cut based partitioning approach, the design is presented as a hypergraph and
the connections between different instances of the design are presented as hyper edges.
The main objective of the partitioner is to minimize the number of hyper edges (con-
nections that traverse more than one partition) in the graph. In this regard, authors in
[17] present Kerninghan-Lin bi-partitioning algorithm. Authors in [18] present FM
partitioning algorithm that uses recursive bi-partitioning approach to find a solution
of a partitioning problem. Similarly authors in [19] present another bi-partitioning
algorithm that promises to give optimal results for small graphs. However, this algo-
rithm either gives sub-optimal or no results for large to very large hypergraph. The
aforementioned three algorithms are the main partitioning algorithms used for digital
systems and the research work done later is mainly an extension of one of these algo-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2365
rithms. Among these algorithms, Fiduccia-Mattheyses (FM) heuristics [18] has known
to produce the best results. It is an iterative partitioning algorithm that minimizes the
cut-net count over multiple iterations. In each iteration, the cut-net cost is reduced
by maximizing the move gain that is associated with each move of the instance from
one cluster to another. In FM algorithm, all moves have either positive or negative
gains. After each move, the gains of all the associated instances are also updated. This
keeps the complexity of overall process linear and allows to find an optimal solution
in minimal time.
The min-cut based partitioning approach can be applied either in a flat manner that
finds a quick solution or it can be applied using multilevel approach. However, flat
partitioning approach’s computation time increases exponentially with the complexity
of the design. For designs having moderate to high complexity, multilevel hypergraph
partitioning approach has been known to produce the best results [20–22]. Multilevel
partitioning approach comprises of three phases namely clustering, top level parti-
tioning and uncoarsening. The main advantage of multilevel partitioning over flat
partitioners is its ability to search the solution space more effectively by spending
comparatively more effort on smaller coarsened hypergraphs. Good coarsening algo-
rithms allow for high correlation between good partitioning for coarsened hypergraphs
and better refinement for the initial hypergraph. Therefore, a thorough search at the
top of the multilevel hierarchy is worthwhile because it is relatively inexpensive when
compared to flat partitioning of the original hypergraph, but can still preserve most of
the possible improvement. The result is an algorithmic framework with both improved
run time and solution quality over a completely flat approach. Multilevel partitioning
approach was successfully demonstrated by hMetis program [22]. This tool mainly
uses FM algorithm for partitioning and it also introduced several new heuristics that
produced reportedly performance critical results. However, this tool partitions designs
with homogeneous instances only and cannot partition heterogeneous instances.
Above, we have presented a detailed discussion on the partitioning problem from
a generic perspective. When we look at partitioning solutions in the context of multi-
FPGA prototyping, different tools/work exist commercially as well as in academia.
Commercially, different tools exist which provide either partial or complete prototyp-
ing flow for multi-FPGA systems. For example Synopsys’ Protocompiler [23] gives
a complete back end prototyping flow for multi-FPGA systems. However, this tool
is accompanied by HAPS [24] hardware platform of Synopsys and works only for
Synopsys specific platforms. Then, there are AUSPY and WASGA [25] partitioning
tools. These tools are platform independent and are not accompanied by a specific
hardware platform. However, these tools give only partial partitioning solution and
do not provide complete prototyping flow. Another partitioning tool by Synopsys
called CERTIFY [26] which was available for partitioning solution of multi-FPGA
systems until recently. It was generic in nature and could have been used for any hard-
ware platform. However, recently it was discontinued and replaced by Protocompiler.
Apart from aforementioned tools, there are several other solutions [27–29] that are
provided by commercial vendors. Just like multi-FPGA prototyping, these solutions
are mainly used for pre-silicon verification. However, these solutions are either sim-
ulation or emulation based. Moreover, these solutions are very costly and they do not
fall in the domain of this paper. The discussion on aforementioned commercially avail-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2366 U. Farooq, B. A. Alzahrani
able partitioning solutions indicates that either these solutions are platform dependent
or they offer only partial solution. Moreover, all of them are proprietary tools with
thousands of dollars in annual subscription fees.
On the other hand, if we look at state-of-the-art academic solutions of partitioning
from multi-FPGA prototyping perspective, sufficient work is not available. Authors
in [30,31] propose a new multilevel hierarchical FPGA architecture and they propose
to use a multilevel partitioning tool for the partitioning of the design. However, their
proposed solution can handle homogeneous blocks and gives partitioning solution for
a single FPGA only. Similarly, authors in [32] explore the partitioning problem for
multi-FPGA systems. They perform comparison between solutions obtained through
commercial WASGA and CERTIFY partitioning tools only and do not give any aca-
demic solution. Also, authors in [33,34] explore prototyping of multi-FPGA systems.
However, for partitioning, they use commercial tool called CERTIFY [26] by Synop-
sys. The main focus of their flow remains the inter-FPGA routing issue of the back end
flow. Furthermore, authors in [35,36] also address the back end flow for multi-FPGA
systems, but their focus remains mainly the inter-FPGA routing as well.
In this work, we not only address the routing issue but we also focus on the par-
titioning problem. Because even a highly efficient routing tool cannot improve the
frequency of final prototyped design if it is preceded by an inefficient partitioning pro-
cess. In order to make the partitioning process efficient, we put particular emphasis on
the knowledge of interconnect of the design under consideration. We extract the infor-
mation on the interconnect of the design through open source tool called VERIFIC
[10]. Because, when it comes to different types of designs, they exhibit different inter-
connect patterns. Some of them are hierarchical in nature while others have rather flat
interconnect. So, partitioning all the designs with a single approach is not justified
and it may eventually lead to poor frequency results. For this reason, in our back end
flow, we propose and explore two different partitioning approaches in this work. first
approach is called hierarchical partitioning approach and it uses a hierarchical par-
titioning algorithm. This approach is more useful for designs exhibiting hierarchical
interconnect. Second approach is based on multilevel partitioning algorithm and it is
more suitable for rather flat designs. Details about the two proposed approaches are
given in Sect. 4. The two partitioning approaches coupled with an efficient inter-FPGA
routing tool give the best frequency results for the partitioned design.
To the best of our knowledge, there is not enough academic work in state-of-the-
art for multi-FPGA prototyping systems from partitioning perspective. As discussed
before, some work exists that either uses commercial tools or performs comparison
between partitioning results of commercial tool. The unique contribution of this work
is that we extract the information on the interconnect of the design through VERIFIC
tool which is free for academia. Next, we apply one of two partitioning approaches that
best exploits the interconnect in terms of minimizing the cut-nets of the partitioned
design. Both the proposed partitioning approaches used in this work are either based
on academic tools or the customized versions of those tools. So, through this work,
we strive to provide a platform for academia in multi-FPGA prototyping and advance
the research in the important domain of pre-silicon verification through multi-FPGA
prototyping.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2367
3 Prototyping flow
In this paper, we propose a prototyping flow for multi-FPGA based systems. In this
flow, we explore two different partitioning approaches and analyze their effect on
the system frequency of final prototyped design. An overview of the complete flow
is shown in Fig. 1. It can be seen from this figure that the flow starts with the logic
synthesis of the benchmark under consideration. After passing through various steps,
the flow terminates at the bitstream generation of the design. Further discussion on
the steps of the flow is given next.
For any exploration flow, benchmarks are a fundamental requirement. For multi-FPGA
prototyping flow, this requirement is even more pertinent as complex benchmarks
mimicking the real life applications are utmost necessary to test the capability of the
tools of a prototyping flow. Researchers in the past [37–39] have used different sets of
benchmarks for different types of exploration environments. But these benchmarks are
either too small to pose a real challenge to the exploration tools or they are synthetic
in nature and lack resemblance with real life applications. In this work, we use bench-
marks that are generated by DSX [40] academic tool. Using this tool, we can generate
mono-core and multi-core MPSoC architectures. A mono-core MPSoC architecture
contains components like UART, RAM, multiple FIFOs, and co-processors. These
components are further connected with each other through a cross bar architecture.
An example of mono core MPSoC architecture is shown in Fig. 2. In a multi-core
MPSoC architecture, we have clusters of mono-core MPSoCs that are connected to
each other using a mesh-based NoC interconnect [41]. As compared to multi-core
MPSoCs, mono-core MPSoCs have lower complexity and flat bus-based intercon-
nect. Multi-core MPSoCs, on the other hand, have higher complexity and hierarchical
interconnect. An example of multi-core MPSoC architecture is shown in Fig. 3.
3.2 Synthesis
It can be seen from Fig. 1 that the benchmarks generated through the DSX tool are
first logically synthesized. During synthesis, the design is logically optimized. For
logic synthesis, in this work, we use open source tool by VERIFIC [10] which is
free for non-commercial academic purposes. When the benchmark is given to this
tool, it parses the whole design through a very powerful parser. The parser of this
tool builds a comprehensive database of all the components of the design and gives
complete information about the interconnect of different components of the design.
This information is very useful as it is used by the hierarchical partitioner later in the
flow. The tool also performs transformation of the design into the standard logic gate
format. We use this tool to keep our flow open source and generic in nature.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2368 U. Farooq, B. A. Alzahrani
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2369
3.3 Partitioning
After synthesis, the partitioning of the design under consideration is performed. Since
the designs are quite large and complex, a single FPGA cannot satisfy their logic and
I/O resource requirements, thus, they have to be partitioned in multiple partitions. As
discussed in Sect. 1, the partitioning plays a very important role in the final execution
speed of the design under consideration. Normally, number of physical connections are
quite small between different partitions while the number of cut-nets that span these
partitions are quite large. So, in subsequent process, these cut-nets have to share the
physical resources between different FPGAs in a time multiplexed manner. Eventually,
larger cut-nets will lead to greater size of multiplexer; hence increasing the delay
and reducing the overall speed. Thus the main goal of any partitioner is to keep the
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2370 U. Farooq, B. A. Alzahrani
(a)
(b)
Fig. 4 a Partitioning Solution with 2 cut-nets; b partitioning Solution with 1 cut-net
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2371
3.4 Routing
Once the partitioning is completed, the routing of the design under consideration is
performed on the multi-FPGA board. The aim of partitioning approaches discussed in
Sect. 3.3 is to minimize the number of cut-nets. However, as discussed in Sect. 1, the
number of cut-nets are always greater than the available I/O resources of FPGAs. This
is because of higher logic capacity and fewer I/Os of newer generations of FPGAs.
Therefore, we have to route the cut-nets in a time division multiplexing manner. A
simplified overview of the flow used to perform inter-FPGA routing is shown in Fig. 5.
It can be seen from this figure that routing flow starts with the routing constraints, board
description (user generated) and trace assignment file (generated by partitioning).
Once these files are given to the routing tool, the routing graph is generated. The
I/Os of this routing graph are represented as a set of vertices V and the connection
between these vertices are represented as edges E. These vertices and edges that
are combined together make a directed graph G(V , E). The graph is later used by the
routing algorithm to route cut-nets on the physical resources of the FPGA board. Once
the routing graph is generated, initial mux ratio is computed as the ratio of maximum
number of cut-nets and the physical wires between two partitions. Next, the cut-nets
are grouped as per the mux ratio value and routing is performed. For inter-FPGA
routing, in this work, we use Pathfinder [42] routing algorithm. This is a congestion-
driven negotiation-based routing algorithm. Pathfinder routing algorithm routes the
cut-nets one by one and tries to find a conflict free solution through negotiation based
approach. For a conflict free solution, it uses an iterative approach through which the
cost of congested nodes is gradually increased to avoid congestion in future. This
algorithm routes all the cut-nets of the design in conflict free manner. Next, the mux
ratio is optimized using binary search algorithm. Each time, a successful routing is
achieved, the mux ratio is adjusted according to binary search algorithm. The binary
search algorithm continues until the best mux ratio is found. This process is also
depicted in Fig. 5. While searching for the minimum mux ratio, the routing algorithm
also tries to keep the number of hops as small as possible. This is because of the
reason that both mux ratio and number of hops affect the final system frequency
which is computed at the end of the routing process.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2372 U. Farooq, B. A. Alzahrani
Once the routing is complete, the netlists are generated as shown in Fig. 1. These
netlists contain all the information related to the partitioned design and their routing
information. The netlists are next passed to the the vendor specific tool to perform
intra-FPGA synthesis, placement, and routing of all the partitions. After a successful
completion of this step, the bitstreams of the partitions are generated which can finally
be loaded into the respective FPGAs to complete the prototyping flow. The process of
loading of the bitstreams allows to perform the in-circuit verification and debugging
of the partitioned designed. Moreover, it also gives the real world, cycle accurate and
bit-accurate execution information of the partitioned design.
A comprehensive overview of all the steps of prototyping flow is given in this sec-
tion. In the next section, a further detailed discussion is provided on the two partitioning
approaches that are proposed and explored in this work.
As discussed in Sects. 1 and 3.3, partitioning plays a fundamental role in the quality
of final prototyped design. It is at this step that the number of cut-nets of a parti-
tioned design are determined. The cut-nets can be either a single source to single
destination (bi-terminal cut-nets) or they can be single source to multiple destinations
(multi-terminal cut-nets). These cut-nets later determine the mux ratio which even-
tually decides the execution speed of the design. It is evident that there is a direct
relation between the number of cut-nets obtained after partitioning and the execution
frequency of the final design. The primary objective of any partitioner is to minimize
the number of cut-nets while also satisfying the logic resource constraint of the target
architecture. In order to best satisfy these constraints, in this work, we explore two
different partitioning approaches; one is termed as hierarchical partitioning approach
while other is called multilevel partitioning approach. Detailed discussion on the two
partitioning approaches is provided next.
It is discussed in Sect. 3.2 that we use VERIFIC to perform logic synthesis of the
design under consideration. While performing logic synthesis, VERIFIC parses the
whole design and it gives complete information about the interconnect of the design.
In hierarchical partitioning approach, we extract information about the hierarchy of
the design from VERIFIC parser tool. At the next step, based on the required num-
ber of partitions and specified capacity of each partition, a hierarchical partitioning
algorithm is applied on the design. The flow of this algorithm is given in Fig. 6. It can
be seen from this figure that initially all the instances of the synthesized design are
marked as unassigned. Next, on the basis of connectivity, these instances are assigned
into different partitions iteratively. The partitioning algorithm adopts a top-down par-
titioning approach. In each iteration, N unassigned instances are chosen. Then, based
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2373
on the hierarchical information, these instances are assigned to a partition where they
are most connected. This step is to ensure that the logic capacity constraint of the target
architecture is not violated. In case of violation, the algorithm reduces the number of
instances by moving further down the hierarchy and tries to assign instances based
on the connectivity. This process continues until all the instances are assigned into
different partitions. The algorithm combines top-down partitioning approach with the
hierarchical interconnect information of the design under consideration to minimize
the cut-net count and as a result it gives the result in a very small time. This kind of
approach is particularly useful for the designs that are inherently hierarchical in nature.
The pseudo code of this algorithm is given in Algorithm 1 and the steps performed are
summarized as follows:
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2374 U. Farooq, B. A. Alzahrani
1. Take the parsed instance list, the number of partitions and partition capacity as
input.
2. Take N unassigned instances and assign them to M partitions based on their
connectivity. While assigning, make sure, partition capacity is not violated and
the cut-net is minimum.
3. Mark N instances as assigned and go back to step 2 again.
4. Terminate when all the instances are assigned.
Get_hierarchy;
Get_partitions;
Get_capacity;
while unassigned_instances do
instances=N;
find(max_connection);
if capacity > N then
assign_instances(N,M);
assigned = N;
end
else if instance_breakable then
level = level - 1;
end
else
Partitioning_impossible;
end
end
Algorithm 1: Pseudo-code for the Hierarchical Algorithm
The aforementioned steps are performed iteratively where connectivity among the
instances is given top priority and the partition size is always respected. As described
in Sect. 5, the above approach is more suited for designs which have an inherent
hierarchical interconnect architecture in nature. For flat designs, a more sophisticated
approach is required which is described next.
Contrary to the hierarchical approach that exploits the hierarchy of the design, the
multilevel approach uses clustering and refinement approach over multiple levels. In
this approach, the instances of the benchmark are first represented in the form of a
hypergraph. Initially, the graph is quite complex as it contains a lot of instances and
it is difficult to partition it. Therefore, the graph is next reduced by merging smaller
instances together. This process is called clustering and it is repeated over multiple
levels until the number of clusters are reduced to a few dozens in number. The process
continues until the graph becomes considerably small and the refinement becomes
easy. An example of this multilevel clustering process is given in Fig. 7 where a large
hyper-graph is reduced to a smaller hyper-graph after multiple iterations of clustering.
Once the clustering process is complete, the refinement of the graph is done and the
graph is expanded in a reverse manner. During the refinement process, the instances
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2375
are moved between different clusters. The objective of the refinement process is to
minimize the overall cut-net count of the design. Each time a block (i.e. instance) is
moved from one cluster to another, the change in the total cut-net count is computed. If
the change is negative (which means total cut-nets are reduced), the move is accepted
and it is rejected otherwise. This is a greedy approach which may lead to a problem
of local-minima. To avoid such situation, moves with positive gain are also accepted
depending upon the level of refinement. At higher levels, such moves are accepted.
However, these moves are not accepted when the refinement is being performed at
lower levels. The refinement process continues until the bottom level of the graph is
reached. Upon reaching this point, the partitioning process is complete and we have
the final partitioned result. An overview of the refinement process is shown in Fig. 8
where only 2-way refinement is shown. However, the proposed multilevel partition-
ing tool is able to perform N-way partition as it is generic in nature. The multilevel
partitioning tool uses same approach as presented in [43] where first clustering is per-
formed which is then followed by initial partitioning and refinement phases. However,
the work presented in [43] performs partitioning of homogeneous instances only. On
the contrary, the proposed tool can handle heterogeneous instances and also takes into
account the maximum partition size.
The multilevel partitioning is a highly sophisticated technique and for flat designs,
it offers better results when compared to hierarchical approach. However, it requires
significantly more time to produce the partitioning result. Furthermore, the hierarchical
approach gives equal or better results for designs which are purely hierarchical in
nature. The pseudo code of the multilevel algorithm used in this work is shown in
Algorithm 2.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2376 U. Farooq, B. A. Alzahrani
level = 0;
hierarchy[level] = hypergraph;
min_vertices = 200;
while hierarchy[level].vertex_count() > min_vertices do
next_level = cluster(hierarchy[level]);
level = level + 1;
hierarchy[level] = next_level;
end
partitioning[level] = a random initial solution for top-level hypergraph;
FM(hierarchy[level], partitioning[level]);
while level > 0 do
level = level - 1;
partitioning[level] = project(partitioning[level+1], hierarchy[level]);
FM(hierarchy[level], partitioning[level]);
end
Algorithm 2: Pseudo-code for the Multilevel Partitioning Algorithm
In this section, we present the experimental results that are obtained through the
exploration flow described in Sect. 3. Initially, an overview of the benchmarks used in
this work is presented and next the results obtained for those benchmarks are discussed.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2377
5.1 Benchmarks
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2378 U. Farooq, B. A. Alzahrani
Fig. 9 Bi-terminal Cut-Net comparison between hierarchical and multilevel partitioning approach
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2379
Fig. 10 Multi-terminal Cut-Net comparison between hierarchical and multilevel partitioning approach
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2380 U. Farooq, B. A. Alzahrani
Fig. 12 Multiplexing ratio comparison between hierarchical and multilevel partitioning approach
Fig. 13 System frequency comparison between hierarchical and multilevel partitioning approach
125
sys_ f r eq = MHz (1)
mux_ratio
The system frequency results obtained using the two partitioning approaches are shown
in Fig. 13. It can be seen from this figure that, for mono-cluster benchmarks, the
multilevel partitioning approach gives better system frequency results whereas for
multi-cluster benchmarks, the hierarchical partitioning approach gives better results.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2381
Fig. 14 Execution time comparison between hierarchical and multilevel partitioning approach
6 Conclusion
For multi-FPGA systems, partitioning plays a very important role in determining the
quality of a final prototyped design. This work explores two partitioning approaches
for multi-FPGA prototyping systems. One approach exploits the inherent hierarchy
of benchmarks while second approach uses a multilevel clustering and refinement
approach to partition the design under consideration. For exploration purpose, we
use a set of fourteen large, complex and realistic benchmarks. Experimental results
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2382 U. Farooq, B. A. Alzahrani
obtained through the exploration environment of this work demonstrate that multi-
level partitioning approach gives overall better results for mono-cluster benchmarks.
On the other hand, hierarchical partitioning approach gives better results for multi-
cluster benchmarks. On average, multilevel approach gives 12.5% better frequency
results for mono-cluster benchmarks whereas hierarchical approach gives 13% better
frequency results for multi-cluster benchmarks. Execution time comparison between
two approaches further reveals that hierarchical approach gives better results irre-
spective of the nature of benchmarks under consideration. Hierarchical partitioning
approach gives on average 60% better execution time results as compared to multilevel
partitioning approach.
In this work, our emphasis has mainly been the exploration of partitioning
approaches. In the future, we will make the proposed multi-FPGA prototyping flow
more comprehensive by introducing novel in-circuit verification techniques. These
techniques can be used for the functional verification of design after the prototyping
of the design is finished.
References
1. Santarini M (2005) Asic prototyping: make versus buy. EDN 11
2. Sigenics: Custom Asic calculator (2017). http://www.sigenics.com/page/custom-asic-cost-calculator
3. AMD (2007). http://techreport.com/news/13721/chip-problem-limits-supply-of-quad-core-opterons
4. Pentium (1994). https://en.wikipedia.org/wiki/pentium_fdiv_bug
5. Graphics M (2017). https://www.mentor.com/products/fv/modelsim/
6. Ian Kuon JR (2010) Quantifying and exploring the gap between FPGAs and ASICs. Springer, Berlin
7. Krupnova H (2004) Mapping multi-million gate SOCS on FPGAS: industrial methodology and expe-
rience. In: Proceedings of design, automation and test in Europe conference and exhibition, vol 2, pp
1236–1241 2
8. Asaad S, Bellofatto R, Brezzo B, Haymes C, Kapur M, Parker B, Roewer T, Saha P, Takken T, Tierno J
(2012) A cycle-accurate, cycle-reproducible multi-FPGA system for accelerating multi-core processor
simulation. In: Proceedings of the ACM/SIGDA international symposium on field programmable gate
arrays, ser. FPGA ’12. New York, NY, USA: ACM, pp 153–162. https://doi.org/10.1145/2145694.
2145720
9. Garey MR, Johnson DS (1990) Computers and intractability: a guide to the theory of NPCompleteness.
W. H. Freeman & Co., New York
10. VERIFIC (2019). https://www.verific.com/
11. Sigl G, Doll K, Johannes F (1991) Analytical placement: a linear or a quadratic objective function?
In: Design automation conference, pp 427–432
12. Alpert CJ, Chan T, Huang D, Kahng A, MarkovI, Mulet P, Yan K (1997) Faster minimization of linear
wirelength for global placement. In: ACM symposium on physical design, pp 4–11
13. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220:671–
680
14. Sechen C, Sangiovanni-Vincentelli A (1985) The timberwolf placement and routing package. JSSC,
pp 510–522
15. Dunlop A, Kernighan B (1985) A procedure for placement of standard-cell VLSI circuits. In: IEEE
transactions on CAD, pp 92–98
16. Huang D, Kahng A (1997) Partitioning-based standard-cell global placement with an exact objective.
In: ACM symposium on physical design, pp 18–25
17. Kernighan B, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J
49:291–307
18. Fiduccia CM, Mattheyeses RM (1982) A linear-time heuristic for improving network partitions. In:
Design automation conference, pp 175–181
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Exploring and optimizing partitioning of large designs… 2383
19. Bui T, Chaudhuri S, Leighton T, Sipser M (1987) Graph bisection algorithms with good average
behavior. Combinatorica 7(2):171–191
20. Alpert CJ, Hagen LW, Kahng AB (1997) Multilevel circuit partitioning. In: Design automation con-
ference, pp 530–533
21. Karypis G, Aggarwal R, Kumar V, Shekhar S (1997) Multilevel hypergraph partitioning: application
in VLSI design. In: Design automation conference, pp 526–529
22. Karypis G, Kumar V (1999) Multilevel k-way hypergraph partitioning. In: Design automation confer-
ence
23. Haps protocompiler by synopsys (2017). http://www.synopsys.com/Prototyping/FPGABasedProto
typing/Pages/protocompiler.aspx
24. Haps multi-fpga board by synopsys (2017). http://www.synopsys.com/Prototyping/FPGABased
Prototyping/Pages/HAPS.aspx
25. Auspy (2017). https://www.mentor.com/products/fv/aupsy
26. Certify partitioning tool by synopsys (2017). http://www.synopsys.com/Prototyping/FPGABased
Prototyping/Pages/Certify.aspx
27. Series CP (2017). http://www.cadence.com/products/sd/palladium_xp_series/pages/default.aspx
28. Veloce MG (2017). https://www.mentor.com/products/fv/emulation-systems/
29. Zebu-server asic emulator by synopsys (2017). http://www.synopsys.com/tools/verification/hardware-
verification/emulation/Pages/default.aspx
30. Marrakchi Z, Mrabet H, Mehrez H (2005) Hierarchical FPGA clustering to improve routability. In:
Conference on Ph.D research in microelectronics and electronics, PRIME
31. Marrakchi Z, Mrabet H, Mehrez H (2006) A new multilevel hierarchical MFPGA and its suitable
configuration tools. In: Proceedings of ISVLSI, Karlsruhe, Germany
32. Turki M, Mehrez H, Marrakchi Z, Abid M (2013) Partitioning constraints and signal routing approach
for multi-fpga prototyping platform. In: 2013 International symposium on system on chip (SoC), pp
1–4
33. Tang Q, Mehrez H, Tuna M (2013) Routing algorithm for multi-fpga based systems using multi-point
physical tracks. In: 2013 International symposium on rapid system prototyping (RSP), pp 2–8
34. Farooq U, Baig I, Alzahrani BA (2018) An efficient inter-fpga routing exploration environment for
multi-fpga systems. IEEE Access 6:56 301–56 310
35. Inagi M, Takashima Y, Nakamura Y (2009) Globally optimal time-multiplexing in inter-fpga connec-
tions for accelerating multi-fpga systems. In: International conference on field programmable logic
and applications, pp 212–217
36. Hauck S, DeHon A (2007) Reconfigurable computing: the theory and practice of FPGA-based com-
putation. Morgan Kaufmann Publishers Inc., San Francisco
37. Stroobandt D, Verplaetse P, Van Campenhout J (2000) Generating synthetic benchmark circuits for
evaluating cad tools. IEEE Trans Comput Aided Des Integr Circuits Syst 19(9):1011–1022
38. Farooq U, Parvez H, Mehrez H, Marrakchi Z (2012) A new heterogeneous tree-based application
specific fpga and its comparison with mesh-based application specific fpga. Microprocess Microsyst
36(8):588–605. https://doi.org/10.1016/j.micpro.2012.06.012
39. Yang S (1991) Logic synthesis and optimization benchmarks user guide, version 3.0
40. Pouillon N, Greiner A (2010) Soc lib project. https://www.asim.lip6.fr/trac/dsx/
41. Miro Panades I, Greiner A, Sheibanyrad A (2006) A low cost network-on-chip with guaranteed service
well suited to the gals approach. In: 1st International conference on nano-networks and workshops,
NanoNet ’06, pp 1–5
42. McMurchie L, Ebeling C (1995) Pathfinder: a negotiation-based performance-driven router for fpgas.
In: ACM international symposium on field-programmable gate arrays, ACM Press, New York, pp
111–117
43. Karypis G, Kumar V (1999) Multilevel k-way hypergraph partitioning. In: Proceedings of the 36th
annual ACM/IEEE design automation conference, ser. DAC ’99, ACM, New York, NY, pp 343–348.
https://doi.org/10.1145/309847.309954
44. Synopsys (2017). http://www.synopsys.com/prototyping/fpgabasedprototyping/. http://www.
synopsys.com/Prototyping/FPGABasedPrototyping/FPMM/Pages/default.aspx
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at