0% found this document useful (0 votes)
11 views

A_Survey_on_FPGA-Based_Heterogeneous_Clusters_Architectures

This document presents a survey on FPGA-based heterogeneous cluster architectures, highlighting the advantages of FPGAs over traditional CPU and GPU approaches in supercomputing, particularly in terms of performance and power consumption. It reviews various implementations and categorizes them into network, hardware, and software tools, discussing the trade-offs and challenges faced by designers. The study aims to provide insights into the evolution of FPGA clusters and identify open research problems in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

A_Survey_on_FPGA-Based_Heterogeneous_Clusters_Architectures

This document presents a survey on FPGA-based heterogeneous cluster architectures, highlighting the advantages of FPGAs over traditional CPU and GPU approaches in supercomputing, particularly in terms of performance and power consumption. It reviews various implementations and categorizes them into network, hardware, and software tools, discussing the trade-offs and challenges faced by designers. The study aims to provide insights into the evolution of FPGA clusters and identify open research problems in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Received 24 May 2023, accepted 15 June 2023, date of publication 21 June 2023, date of current version 10 July 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3288431

A Survey on FPGA-Based Heterogeneous


Clusters Architectures
WERNER FLORIAN SAMAYOA 1,2 , MARIA LIZ CRESPO 1, ANDRES CICUTTIN 1,

AND SERGIO CARRATO 2


1 Multidisciplinary Laboratory (MLab), The Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, Italy
2 Dipartimento di Ingegneria e Architettura (DIA), Universitã degli Studi di Trieste, 34127 Trieste, Italy
Corresponding author: Werner Florian Samayoa ([email protected])
This work was supported by the University of Trieste and The Abdus Salam International Centre for Theoretical Physics.

ABSTRACT In recent years, the most powerful supercomputers have already reached megawatt power con-
sumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining
this trend. To this date, the prevalent approach to supercomputing is dominated by CPUs and GPUs. Given
their fixed architectures with generic instruction sets, they have been favored with lots of tools and mature
workflows which led to mass adoption and further growth. However, reconfigurable hardware such as FPGAs
has repeatedly proven that it offers substantial advantages over this supercomputing approach concerning
performance and power consumption. In this survey, we review the most relevant works that advanced the
field of heterogeneous supercomputing using FPGAs focusing on their architectural characteristics. Each
work was divided into three main parts: network, hardware, and software tools. All implementations face
challenges that involve all three parts. These dependencies result in compromises that designers must take
into account. The advantages and limitations of each approach are discussed and compared in detail. The
classification and study of the architectures illustrate the trade-offs of the solutions and help identify open
problems and research lines.

INDEX TERMS FPGA, SoC, heterogeneous computing, supercomputing, reconfigurable computing.

The notion of reconfigurable hardware has been present using a Hardware Description Language (HDL), such as
since 1984, when Altera delivered the first programmable VHDL or Verilog. The HDL description is then synthesized
logic device (PLD) to the industry [1]. Then, in 1985 into a netlist that is mapped onto the FPGA’s logic ele-
Ross Freeman and Bernard Vonderschmitt patented the ments and interconnections required to implement the desired
first commercially viable field-programmable gate array digital design. The final implementation in the FPGA is
(FPGA) [2]. Owing to production costs, when compared to performed using vendor-specific tools such as Vivado [3],
application-specific integrated circuits (ASICs), FPGAs are Vitis [4], Quartus [5], and Libero [6]. Once the mapping and
traditionally used in applications with low production vol- routing process is completed, the design is compiled into a
umes that require high throughput and low latency. bitstream file loaded onto the FPGA to configure its logic
FPGAs are electronic devices that consist of many config- elements and interconnections to create a circuit correspond-
urable logic blocks composed of look-up tables, flip-flops, ing to the algorithm. It has to be added, however, proprietary
I/O blocks, and interconnection fabric. FPGAs are used to FPGA vendor tools have dominated the field, there are now
create custom hardware solutions, which make the imple- some open-source FPGA tools, such as Yosys [7], F4PGA [8],
mentation of algorithms quite different from targeting a CPU. and RapidSilicon [9], that provide alternative options for
The initial step typically consists of describing the algorithm developers seeking open-source solutions.
FPGAs have evolved into more complex devices [10]
The associate editor coordinating the review of this manuscript and by integrating components, such as embedded memory
approving it for publication was Vincenzo Conti . resources, clock management units, digital signal processing

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


VOLUME 11, 2023 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 67679
W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

TABLE 1. The 13 dwarfs of Berkeley [16], where each one represents an and development time. The preference for FPGAs is due
algorithmic method encapsulating patterns of communication and/or
computation with example problems. to their reconfigurability, which allows extreme hardware
specialization when needed. In addition, the fact that FPGAs
offer a wide array of input-output ports makes them ideal for
stream computation and for creating pipe-lined systems that
can maintain high throughput with low latency.
The purpose of this survey is to demonstrate and ana-
lyze the challenges of heterogeneous supercomputing by
studying the most relevant implementations of FPGA-based
cluster architectures from different application fields. Each
studied platform provides valuable insight into the decisions
and tradeoffs developers have made to reach their specific
goals. By leveraging their experience, it will be possible to
visualize the evolution and present trends in FPGA-based
clusters and target the main open challenges. We propose
dividing the architectural components of each cluster into
network, hardware, and software tools. This division helps
identify and discuss the pros and cons of each component in
its corresponding domain.
The main contributions of this study are as follows:
1) The comprehensive study of the state-of-the-art of
FPGA-based clusters.
2) A three-way segmentation of the clusters’ architecture.
3) A critical discussion of the components that build up
blocks (DSP), network-on-chip (NoC), and CPUs [11]. These
the studied clusters.
hybrid devices are known as system-on-chip (SoC-FPGA)
or adaptive SoCs, depending on the vendor. Their increased In the context of this paper, we describe a cluster by its
capabilities have increased interest in specific applications computational units (CU), which correspond to its small-
and general purposes [12], [13]. est independent part and sometimes coincide with a single
As a reconfigurable device, FPGA offers the advantage of network node. Each CU can be composed of several compu-
continuous improvement in hardware and software. In fact, tational elements (CE), namely CPUs, GPUs, and FPGAs.
being able to change the architecture offers great freedom The remainder of this paper is organized as follows.
when developing complex systems. Furthermore, FPGAs Section I elaborates on the implementations and explores rel-
have been shown to consume considerably less power than evant advancements in their application fields. A table at the
CPUs and GPUs [14], leading to reduced cooling and energy end of each application field discussion summarizes the main
costs. contributions of each study, along with the reported perfor-
By studying computing problems, classification based on mance and energy improvements, when available. Significant
repeating algorithmic patterns was proposed in 2004 [15]. In differences can be understood by studying the evolution of
2006, [16] 6 new algorithmic encapsulations were defined, heterogeneous clusters within each niche. Section II presents
expanding the classification to 13 dwarfs as shown in Table 1. the classification of systems from an architectural perspec-
Theoretically, each dwarf can be mapped onto a specific com- tive. The three main aspects described in each implementation
puting architecture [17], [18]. This has inspired the creation were used as comparison points. Subsection II-A presents
of benchmarks for heterogeneous systems such as Dwarf- a comparison of the network infrastructure in the studies.
Bench [19], Rodinia [20], and OpenDwarfs [21]. The hardware available in each studied cluster is discussed
Several implementations of heterogeneous high- in Subsection II-B. To complete the classification discus-
performance computing (HPC) systems housing FPGAs sion, the developer tools are compared in Subsection II-C.
can be named, such as Project Catapult at Microsoft [22], In Section III we present the remaining open problems and
Alibaba FaaS (FPGA as a Service) [23], Amazon EC2 F1 the identified trends. To close this paper, Section IV draws
instances [24], and ARUZ cluster at Lodz University [25]. conclusions.
At CERN, the massive adoption of FPGAs for online data
processing has motivated the development and adoption of I. CLUSTER IMPLEMENTATIONS
specific tools to aid the development of applications based Different FPGA-based cluster implementations were stud-
on FPGAs, such as hls4ml [26] (high-level synthesis for ied, and their specific characteristics highlight the purpose
machine learning). This tool, along with many others [27], for which they were planned. Technological advances offer
[28], [29], [30], [31], allows for a higher level of abstrac- greater flexibility, and cost reduction opens the door to
tion, thereby significantly reducing implementation errors increasing complexity. It can be appreciated that there is a

67680 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

growing interest in developing research-capable platforms


to explore diverse areas of heterogeneous supercomputing.
Tables 2, 3, 4, 5, and 6 provide a summary of the contribu-
tions of each work and the reported energy and performance,
if available.

A. MANYCORE EMULATION
The development of manycore platforms is a long and expen-
sive process that involves several stages of experimentation,
validation, and integration. There are software tools that help
simulate architectures for easy parameter tuning, with the
major drawback of speed. In this particular aspect, FPGA
prototyping allows faster execution times and benefits from
insights from real hardware. It is not rare for a complete
platform to exceed the logic available in a single FPGA,
FIGURE 1. Fast [33] computational unit (CU) with the computing tiles in
pushing for a cluster of FPGAs. orange and the FPGA hub in purple.
This was the case since 1997, when one of the first FPGA
clusters was used to emulate the RAW architecture [32].
The RAW cluster consisted of 5 boards or CUs, each with interconnect express (PCIe) 2.0 × 8 for the host connection
64 FPGAs, totaling 320 FPGAs. Its results showed orders of to configure and manage up to 4 DB-V5 (daughter board
magnitude speed-up compared to contemporaneous scalable version 5).
processors with the disadvantages of reduced flexibility, high Figure 2 shows the RAPTOR-Xpress baseboard with
cost, and high implementation complexity, which hindered 4 DBs interfaced directly with their neighbors in a ring
their adoption in other research applications. topology. Each has a Xilinx Virtex-5 FPGA with up to 4
In 2006, the FAST [33] cluster was presented to bring hard- GB of DDR3 memory and a dedicated FPGA as a PCIe
ware back into the research cycle to address the disadvantages interface Multiple baseboards can be connected together via
of RAW. FAST combined dedicated microprocessor chips and 4 high-speed connectors, each consisting of 21 full-duplex
static random access memories (SRAM) with FPGAs into a serial lanes, enabling scaling resources beyond the 4 DB on
heterogeneous hybrid solution to simulate chip multiproces- board. The baseboards can also be interfaced with the host
sor architectures. The vision was to reduce hardware costs via dedicated FPGAs on Nallatech front-side bus acceleration
and ease development, both for programming and portability. modules [36], which provides an extra 8.5 Gb/s for writing
Each FAST CU consisted of 8 processors, 10 Xilinx Virtex and 5.6 Gb/s for reading.
FPGAs, and 4 memory-interconnected tiles. The 2 processors The RAPTOR project also comprises a custom soft-
in each tile acted as the CPU and floating processing unit, ware development environment that includes RAPTORLIB,
respectively, and 2 FPGAs acted as the level-one memory RAPBTORAPI, and RAPTORGUI tools, which aid devel-
controller and coprocessor. opers by providing hardware-supported protocols, remote
A central hub, made up of 2 FPGAs, was used to manage access, and a graphical user interface to facilitate testing. The
shared resources and orchestrate communication between design flow includes aids for design partitioning, which is
tiles allowing access to off-the-board devices through exter- a manual process assisted by a graphical integrated devel-
nal IOs. Additionally, the expansion connector available to opment environment (IDE) and standard synthesis tools
the FPGA hub allows multiple FAST CUs to be connected. developed in vMAGIC [37].
The CU implementation is illustrated in Figure 1. Convinced by the need for cheaper and smaller hardware,
A custom software stack was developed specifically for the Formic cluster [38] based on the Formic board [39]
FAST. It included several modules and predefined interfaces was presented in 2014. The Formic board acts as the build-
for functionality and benchmarking. An operating system was ing block for a larger system, with a maximum size of
developed to manage control tasks such as programming and 4096 boards. Each board consists of an FPGA, SRAM,
configuration. Portability was demonstrated by implementing 1 GB of double data rate (DDR) RAM, a power supply,
several architectures; however, scalability and costs remained buffered joint test action group (JTAG) connectors, and
open to discussion. configuration memory, making it independent and perfectly
Similar to FAST, the RAPTOR cluster was presented as a symmetric. Eight multi-gigabit transceivers (MGT) at a max-
baseboard hosting up to 4 daughter cards based on complex imum speed of 3 Gb/s are available for interconnection on
programmable logic devices (CPLD) [34]. In 2010, a second 8 serial advanced technology attachment (SATA) connec-
version was presented using FPGAs and a renewed architec- tors. Inside each board, a full NoC with a 22 port crossbar
ture [35]. This new version consisted of a RAPTOR-Xpress switch interfaces the configured blocks with MGT links and
baseboard (CU) that provides two buses for Gigabit Ethernet, allows developers to scale the designs. Access to local and
universal serial bus (USB) 2.0, and peripheral component remote memories is done using the Remote Direct Memory

VOLUME 11, 2023 67681


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

FIGURE 2. Simplified diagram of the RAPTOR-Xpress board [35] or computational unit (CU) with the
daughter boards in orange.

Access (RDMA) protocol [40]. As the first application, The core of Janus comprised an array of 4 by 4
a multicore system based on 8 custom MicroBlaze [41] FPGA-based simulation processors (SP) which were con-
processors per module forming a 512-core cluster [42] was nected with their nearest neighbors. Another processing unit
implemented. called an Input/Output processor (IOP), acted as a crossbar
Simultaneously, the industry has produced exciting devel- and was in charge of managing communications between
opments in manycore emulation. In an attempt to reduce the FPGAs and the host.
time to market for new ICs, Cadence [43] and Siemens [44], A two-layer software stack was created to help developers
together with others, developed solutions for the prototyping build applications. The firmware layer consisted of a fixed
of ICs. Unfortunately, there is little accessible information part targeting the IOPs, which included a stream router and
regarding the architecture of most implementations, and the dedicated devices to communicate, manage, and program the
high costs make them uncommon in academia, with some SPs. The second layer, the Janus Operating System (JOS),
exceptions, such as the Pico Computing board (now Micron) consisted of the programs running on the host PCs, includ-
used for image processing [45] and the DINI (now Synopsys) ing a set of libraries (JOSlib) to manage the IOP devices,
board FPGA board used for online video processing [46]. a Unix socket application program interfaces (APIs) to inte-
From the described works, it can be seen that there is grate high-level applications and new SP modules, and an
a trend in reducing the complexity of CUs, as shown in interactive shell (JOSH) for debugging and testing.
Figure 3. In this field, costs tend to be the leading factor, mak- In the worst case, Janus performed just 2.4 times faster
ing granularity a desirable characteristic. With smaller CUs, than conventional PCs. Nonetheless, Janus was limited by its
it is possible to reduce the implementation costs, depending performance and scarce memory for some applications [52].
on the requirements of the chip to emulate. Smaller CUs also In parallel, great interest has been shown in the cryptanal-
make it easier for clusters to scale, maintain, and upgrade. ysis field with the development of the COPACOBANA FPGA
cluster [53] in 2006. Figure 4 shows the COPACOBANA
B. SCIENTIFIC COMPUTING cluster which was built over a CU holding up to 20 dual
The complexity of scientific computing problems has always in-line memory modules each with 6 Xilinx Spartan-3 FPGAs
pushed technology to its limit, making computer clusters directly connected to a 64-bit data bus and 16-bit control
a basic requirement. Regardless of whether complex algo- bus. A controller module allowed the host PC to interact via
rithms process huge amounts of data or massive system USB or Ethernet through a software library that provided the
simulations, reconfigurable computing provides the level of necessary functions for the PC to program, store, and read the
customization required by these problems. This did not go status of the cluster as a whole or as individual FPGAs. This
unnoticed, as early as 1991, programmable hardware was made it possible to scale resources by attaching another CU to
already part of custom supercomputers for specific problems the host PC. Its capabilities were demonstrated by testing sev-
like in RTN [47], RASA in 1994 [48], and later in SUE eral encryption algorithms, which resulted in it outperforming
2001 [49]. conventional computers by orders of magnitude [54].
The first massive cluster was created in 2006. Janus [50] The positive outcome of this project motivated the creation
was a massively parallel modular cluster for the simulation of of a hybrid FPGA-GPU cluster [55] based on commercial off-
specific theoretical problems in physics developed by a large the-shelf (COTS) components in 2010. The Cuteforce [56]
collaboration of European institutions [51]. system implemented 15 CUs, 14 with Xilinx Virtex FPGAs,

67682 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

FIGURE 3. Clusters targeting manycore emulation have shown a trend of reducing the
complexity and increasing the granularity of CUs to favor production costs and
scalability.

TABLE 2. Manycore emulation clusters’ contributions and reported performance improvement.

and the last with an NVIDIA GPU interconnected through 8U rack boxes, each with 16 boards. The boards in the boxes
a CPU on a CU via Infiniband. The results were not were interconnected through a PCIe to the eSATA board.
as expected, partly because of complications in FPGA A small Linux computer allowed remote programming using
implementation. a USB-to-JTAG converter and a DE2 board as a JTAG fan-out
The same approach was later used in 2010 by Tse, et al. [57] to parallelize the configuration.
who focused on Monte Carlo simulations. However, instead The Bluehive development environment was supported by
of using one CE per CU, a single CU was used to host 2 CPUs, Quartus and mandatory blocks were provided to developers,
an NVIDIA GPU, and a Xilinx FPGA, which was further routers for inter-FPGA communication, FBs, and high-speed
supported by a comprehensive analysis of the performance serial link controllers [61], all developed on Bluespec Sys-
and energy. The network remained practically unchanged temVerilog [62].
from Cuteforce, where the CPUs are the main communication In 2014, Janus received an important upgrade [63], which
CEs and relegate GPUs and FPGAs to an accelerator posi- significantly improved its performance. The architecture
tion. To further demonstrate the scalability of this strategy, remained mostly the same, with the largest change in the
Superdragon [58] was created to accelerate single-particle adoption of newer FPGAs with 8 GB of RAM and MGTs
cryo-electron 3D microscopy. instead of ordinary I/Os for interconnection.
Bluehive [59] also sought to distance itself from custom Janus II and Bluehive were successful in tackling the mem-
PCBs by embracing commodity boards to build a cus- ory issue, but as problems scale, larger clusters were needed.
tom FPGA cluster for scientific simulations and manycore This was the case for ARUZ [25], an application-specific
emulation [60] requiring high-bandwidth and low-latency cluster formed by approximately 26,000 FPGAs distributed
communication. These challenges were overcome with the over 20 panels, each consisting of 12 rows, which in turn con-
development of a 64-node FPGA cluster based on Terasic tained 12 CU. The CUs are composed of eight slave FPGAs
DE4 boards that host an Altera Stratix IV FPGA, an 8xPCIe that constitute the resources and a central master SoC-FPGA
connector, and a DIMM with 4 GB of RAM and interfaced that manages operations. The addition of the Zynq SoC is
through a custom interconnect called BlueLink [61] with four motivated by the higher abstraction level provided by the

VOLUME 11, 2023 67683


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

and FPGAs [69]. The RTC modules have a standard CPU


server that hosts an NVIDIA GPU, Intel CPU, and Intel
Arria 10 FPGA. The FPGAs are hosted on a custom main-
board called µXComp which includes 2 GB of onboard RAM,
PCIe 3, an FMC connector, Ethernet, 4 QSFP, and other
valuable resources.
In this heterogeneous system, communication between
GPUs is performed by a Smart Interconnect (SI) system
implemented on FPGAs. The SI uses the UDP protocol,
which is implemented in the FPGA fabric alongside the
device protocol handlers and dedicated direct memory access
(DMA) engines. This is configured with the QuickPlay
FPGA framework, which extends its capabilities by using
abstraction models and board support packages (BSPs) for
FIGURE 4. COPACOBANA [53] computational unit (CU) with the dual portability. This architecture allows pipelining several GPUs
in-line memory module (DIMM) modules in orange, each with 6 FPGAs, and FPGAs. A similar approach can be seen in Spinnaker [70]
and the controller module in purple.
and BrainscaleS [71] supercomputers, which implement ded-
icated ASICs interconnected by FPGAs for neuromorphic
ARM processor for slow-control tasks. In addition, each CU computing, and MDGRAPE [72] for modular dynamics
is interfaced with a concentrator board (CB) that feeds the simulations.
state of the simulations to a host that controls the entire
process. C. FPGAS IN DATA CENTERS
Global communication is based on Gigabit Ethernet and The positive results obtained by FPGAs attracted great inter-
allows data exchange between SoC-FPGAs to configure its est outside of the scientific community. Specifically in the
8 FPGAs. All nodes are connected in a Daisy chain, and data center (DC) context, where computing tasks can quickly
only one board is connected to an external switch. A custom overwhelm CPUs. DC workloads demand reduced power
protocol for data transfer was developed, consisting of a small consumption, latency, and cost while maximizing computing
packet of no more than 256 bytes, with a constant overhead power and flexibility.
of 11 bytes. Catapult [22] is a successful example of the inclusion
ARUZ designers developed their own methodology [64], of FPGAs in high-reliability commodity DC. FPGAs were
as there are no standard solutions available. Considering specifically selected given that the flexibility of recon-
the multitude of mechanism combinations for programming figurable hardware helps tackle the 2 main requests in
and controlling ARUZ, a high level of flexibility is required. DCs. First, the desire for homogeneity greatly facilitates
VPP [65] was selected for code pre-processing and parame- the installation, maintenance, and deployment of ser-
terization. DLLDesigner was developed to generate VHDL vices. Second, there is a need for flexibility, considering
code for interconnecting as many FBs as required. All of that such services evolve rapidly, making fixed hardware
these tools allow the implementation of highly optimized impractical.
architectures for molecular simulations. A custom half-width unit motherboard was developed to
FPGAs have also found a place in neuromorphic com- host 2 high-end CPUs and the daughter FPGA card, which
puting, as demonstrated by Bluehive. Spiking neural net- consisted of a Stratix V D5 FPGA with 8 GB of DRAM
works (SNN) require many densely interconnected elements. and acted as the CU. Two 12-core Sandy Bridge processors
A substantial level of parallelism is suitable for hardware with 64 GB of RAM, 2 SSDs, and 4 HDDs complete the
acceleration; however, the challenge is scalability. This was resources present on the motherboard. The FPGA and host
specifically addressed by Astrobyte [66] using a fully scalable CPUs communicate via PCIe, and high-speed transceivers
NoC-based FPGA cluster with functional verification and are used in the inter-FPGA network. A two-dimensional 6 ×
real-time monitoring. 8 node torus was selected for the network configuration in
However, more specialized platforms presented better each rack. For the final system, 34 of these racks were used
results at higher costs. This is the case for BiCoSS [67], a for a total of 1632 nodes.
35 system-on-module cluster, each with a Cyclone IV FPGA To evaluate the performance of Catapult, a significant
and 2 SDRAMs capable of simulating 4 million spiking portion of the ranking stack of Bing was offloaded to each
neurons in real-time. rack. To guarantee the reliability of the system, the following
Another relevant application in the scientific context is services were implemented:2-bit error detection and 1-bit
real-time control (RTC) systems of adaptive optics (AO) error correction on top of the CRC in the DRAM and high-
instruments. This is the main focus of the Green Flash speed network. For user productivity and reusability, the
project [68] that aims to develop energy-efficient real-time FPGA space was split into 2 parts. A shell that hosts hardware
HPC accelerators and smart interconnects, based on GPUs controllers, an inter-FPGA network stack, a status notifier,

67684 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

TABLE 3. Scientific computing clusters’ contributions, reported power and performance gains.

and a single-event upset logic to reduce system errors that and the added cost of ownership did not exceed the
consume 23% of the FPGA resources, and a role part where limit of 30%. These results show the significant advantage
the computing logic lies. Additionally, a Mapping manager that FPGAs can offer in terms of throughput and power
and health monitor continually scanned each node in the consumption.
network. In case of failure, the faulty node is immediately With the success of Catapult [22], it was only a matter
reconfigured. If the issue persists, the node is flagged for of time before FPGAs were made available for cloud com-
manual intervention, and the mapping manager automatically puting tasks, which is exactly what the IBM cloudFPGA
relocates the services to the available resources. [73] did. Virtualizing the user space makes FPGAs in an
With custom hardware and communication protocol, Infrastructure-as-a-Service (IaaS) environment feasible for
Catapult achieved an improvement of 95% in through- education, research, and testing.
put in a production search infrastructure when compared In the architecture presented, the FPGAs are standalone
to a software-only solution. In addition, the inclusion of nodes in the cluster directly interfaced to the DC via PCIe,
the FPGA increased the power consumption by only 10%, unlike the approach of Amazon [24], Alibaba [23] and IBM

VOLUME 11, 2023 67685


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

Supervessel [74] which tie the FPGAs to host CPUs. Under Similar to the RCC project, the FPGA High-Performance
this approach, a daughter card consisting of an FPGA and Computing Alliance (FHPCA [87]) was established in
abundant RAM was developed. By creating a custom carrier 2005 with the Maxwell supercomputer [88]. The Maxwell
board, 64 daughter cards can be accommodated in a single 2U CUs were built on a standard IBM BladeCenter chassis,
rack chassis [75]. To achieve the desired homogeneity within in which an Intel Xeon and 2 FPGAs were interfaced
the DC, FPGAs have been provided with a soft network via PCI-X. Additionally, an FPGA-dedicated network is
interface chip, with the advantage of loading only the required available via MGTs without routing logic, given the nearest-
services. neighbor scheme. By supporting standard parallel computing
The multi-FPGA fabric formed by multiple prototypes of a software, structures, and interfaces, it sought to disrupt the
network-attached FPGA was evaluated with a text-analytics HPC space without causing significant friction.
application. The results, compared to a software implemen- To facilitate the development of applications targeting
tation and an implementation accelerated with PCIe-attached Maxwell, the Parallel Toolkit (PTK) [89] was developed.
FPGAs, show that the network-attached FPGAs improved in It included a set of practices and infrastructure to solve
latency and throughput. Additionally, network performance issues such as associating tasks with FPGA resources, seg-
was compared with bare metal servers, virtual machines, and menting the application into bitstreams, and managing code
containers [76] with results showing orders of magnitude bet- dependency. PTK provided a set of libraries where common
ter for the FPGA prototype. To further improve the usability standard interfaces, data structures, and components were
of the platform, continuous developments have been made defined.
to integrate MPI into the system [77], [78]. An in-depth Similarly, Cube was created to explore the scalability of
study of FPGA cloud computing architectures is available a cost-effective massive FPGA experimentation cluster for
in [79] and [80]. real-world applications. It consisted of 8 boards that host
a matrix of 8 by 8 Xilinx FPGAs [90] forming a cluster
of 512 FPGAs, as shown in Figure 5. It features a single
D. GENERAL-PURPOSE CLUSTERS configuration of multiple data-programming paradigms that
Overspecialized systems tend to constrain the potential of allowed all FPGAs to be configured with the same bitstream
reconfigurable hardware in favor of optimizing performance in a matter of seconds. The FPGAs were interconnected
or costs. Nevertheless, general-purpose clusters are addressed in a systolic array that reached up to 3.2 Tb/s inter-FPGA
by a larger group of projects seeking to change the pro- bandwidth offering significant advantages as it simplified the
gramming paradigm. These clusters, rather than being a programming model and greatly relaxed the requirements of
general purpose in the broad sense of the word, serve as the PCB layout.
experimental platforms to test solutions to all heterogeneous Simultaneously, Quadro Plex (QP) [91], a hybrid cluster
supercomputing challenges, ranging from network to user was introduced. It was composed of 16 nodes, each consisting
experience. of one AMD CPU, 8 GB of RAM, 4 NVIDIA Quadro GPUs,
One of the first projects was the Reconfigurable Com- and one Xilinx Virtex 4 Nallatech FPGA accelerator. The
puting Cluster (RCC) [81] in the early 2000s. It was a nodes were interconnected using Ethernet and Infiniband.
multi-institution investigation project that explored the use Cluster communication was managed using OpenFabrics
of FPGAs to build cost-effective petascale computers, with its Enterprise Distribution software stack. The complete system
main contribution being the introduction of microbenchmarks occupied four 42U racks, consumed 18 kW, and had a theo-
for software, network performance, memory bandwidth, and retical performance of 23 TFLOPS. CUDA was used for GPU
power consumption. To evaluate each test Spirit, a cluster development, and the FPGA workflow completely relied on
consisting of 64 FPGA nodes was built. Each node had a the Xilinx ISE design suite [92].
Virtex 4 FPGA with 2 Gigabit Ethernet ports, 8 DIMM Several applications were developed, showing that there
slots for onboard RAM, and 8 MGTs for the board-to-board were substantial difficulties in taking advantage of an entire
interconnection [82] using the Aurora protocol [83]. system. Applications would only use a combination of CPUs
For internode communication, a configurable network and GPUs or CPUs and FPGAs. A framework for easing the
layer core was developed as part of an Adaptable Computing porting of applications and providing a compatibility layer
Cluster project [84]. It consists of a network switch imple- for different accelerator workflows, called Phoenix [93] was
mented in the FPGA acting as a concentrator for the router. developed.
Considering that the head node is a workstation, a message- In the same spirit, Axel [94] was built, consisting of
passing interface (MPI) approach offered the flexibility that 16 nodes. Each node had an AMD CPU, an NVIDIA Tesla
the cross-development environment required. A custom com- GPU, and a Xilinx Virtex 5 FPGA occupying a 4U full-scale
piler based on GNU GCC was built to support OpenMPI and rack. All CEs were connected to a common PCIe bus for
its Modular Component Architecture (MCA) [85] which was intranodal communication and between nodes in a Gigabit
adapted to support the high-speed network. A software infras- Ethernet network. Considering the high latency and nonde-
tructure based on a Linux system allowed users to access, terministic nature of Ethernet, a parallel network using the
manage, and configure all nodes of the cluster via SSH [86]. 4 MGT of the FPGA was also available.
67686 VOLUME 11, 2023
W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

TABLE 4. Data center FPGA clusters’ contributions reported power and performance improvement.

FIGURE 5. Cube [90] computational unit (CU) showing the configuration controllers in purple. Dotted lines
show the control and configuration bus and solid lines show the data path.

The cluster was managed remotely from the central node for managing GPUs and FPGAs. For this to be feasible, all
using the Torque [95] resource manager and the Maui [96] Axel programs needed to allocate part of the resources in the
scheduler. A custom resource manager (RM) was responsible CEs to interface with the RM runtime API. Using an IPC

VOLUME 11, 2023 67687


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

message queue framework, CEs communicated their state to


the head node. The central node collected information from
all nodes with the help of the RM and prepared a script to
submit the jobs to Torque. Communication between tasks in
different nodes was performed via OpenMPI using Gigabit
Ethernet.
To implement an application in Axel, users would pro-
vide a data flow graph and hardware abstraction model.
A MapReduce framework then rewrites the application
for partitioning the analysis into tasks. These tasks are
assigned to the corresponding CEs based on the targeted
attributes.
Axel also introduced an architecture classification for het-
erogeneous systems based on uniformity, shown in Figure 6.
Following this classification, Axel is a Non-Uniform Node
Uniform System (NNUS) architecture. This means that
all nodes are equal but are built with different CEs. The
advantage of this architecture is that the single-program
multiple-data (SPMD) programming paradigm can be imple-
mented easily. Axel also brought to light the need to reduce
the design time and implementation time of FPGA, possibly
by parallelizing the process to use heterogeneous clusters
to optimize its own executable. Furthermore, it showed that
design exploration tools were also lacking and essential for
automating the performance estimation and code generation
for multiple accelerators.
In 2010, Novo-G was presented as an experimental
research cluster [97] consisting of 68 compute nodes built
FIGURE 6. Axel [94] node or computational unit (CU) classification
with COTS components. Its purpose was to help understand showing possible uniform and non-uniform node and system
and advance the performance, productivity, and sustainabil- configurations for heterogeneous clusters of CPUs, GPUs, and FPGAs.
ity of future HPC systems and applications focusing on
the sustainability problem of current HPC systems using
three different PCIe Intel FPGA boards: 24 nodes with 192 The success of Novo-G and the advancement of technol-
Stratix III FPGAs boards, 12 nodes with 192 Stratix IV FPGA ogy have allowed Novo-G to be upgraded to Novo-G# [98].
boards, and 32 nodes with 128 Stratix V. The cluster is made up of Gidel ProceV accelerators that
Novo-G has been used for several acceleration projects, house Stratix V FPGAs, two 8 GB DDDR3, and 32 Mbits of
ranging from biology to finance. One aspect all applications SRAM memory. The boards were interconnected by grouping
have in common was being embarrassingly parallel and, 24 transceivers into six groups to support a torus topology
therefore, naturally scalable. All of these applications were with a total bandwidth of 300 Gb/s. The physical connection
developed using the software offered as part of the Novo-G is done with fiber optics using QSFP+ modules. The data are
platform, and the results showed an enormous speed-up com- transmitted via packets through a configurable single-level
pared to CPU clusters. router network. This allows one to instantiate as many routers
Chimera was the first work to focus on implementing an as necessary to service the ports and increase the internal
algorithmic FPGA and GPU pipeline. The Chimera clus- bandwidth at the expense of the hardware resources. The
ter [17] was built using commercial components to explore network flexibility enables users to experiment with a variety
alternative solutions to the computational constraints found in of routing modalities, depending on the requirements of the
astronomy and provide access to high-performance comput- application. Novo-G# nodes support three communication
ing hardware for inexperienced users. The system is formed blocks: a Low Latency, a Custom block, and Interlaken [99]
by CUs equipped with one CPU for management tasks, to allow the optimization of the physical layer depending on
which is interfaced with 3 NVIDIA Tesla GPUs and 3 Altera the application.
Stratix IV FPGAs through PCIe via a backplane. Communi- A common problem in custom computing is the lack of
cation could always be considered a bottleneck, but in this software development tools to help users build applications.
case, it is clear that this limitation is directly related to the To solve this problem, the Novo-G# team developed a mod-
algorithms implemented in the entire system and the way each ified Altera OpenCL to provide extended support for the 3D
CEs interacts with the others. torus network present in the cluster.

67688 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

An important aspect that most clusters left out, besides to coherence messages, thermal and power monitoring, and
those focused on communication, was the interface with the an open baseboard management controller (BMC) allows
physical world. This is the empty space that the Axiom plat- for research that is not possible in any current commercial
form [100] seeks to fill with a custom scalable cluster based systems.
on a board with a Xilinx MPSoC (Multiprocessor System on Likewise, UNILOGIC [110] presented a new approach,
Chip) supporting the Arduino interface. this time from the management of the cluster by introducing a
The MPSoC has an FPGA fabric, four 64-bit ARM cores Partitioned Global Address Spaces (PGAS) parallel model to
for general-purpose applications, and two 32-bit ARM cores heterogeneous computing. This allows hardware accelerators
for real-time applications in the same die. Four USB-C ports to directly access any memory location in the system, and
managed by the FPGA MGTs are available for interconnect- locality makes coherency techniques unnecessary, greatly
ing the boards. A custom network interface (NI) in the FPGA simplifying communication. By integrating Dynamic Partial
provides support for all communications, allowing users to Reconfiguration (DPR) into the framework, accelerators can
focus on their applications written on an OpenMP extension be installed on the go. The UNILOGIC architecture was eval-
called OmpSs. The NI is divided into six main groups: a data uated on a custom prototype consisting of 8 interconnected
mover that deals with DMA transfers, RX and TX controllers, daughter boards, each with four Xilinx Zynq Ultrascale+
and FIFOs to cache packets. A router is interfaced with MPSoCs and 64 Gigabytes of DDR4 memory, yielding better
each NI and is responsible for handling the USB-C channels, energy and computing efficiency than conventional GPU or
monitoring the network, and establishing virtual circuits. CPU parallel platforms.
As part of the Axiom project, a custom software stack [101] In 2022, the supercomputer Cygnus [111] was updated
consisting of multiple layers was also developed. Its founda- [112] to follow a multi-hybrid accelerators approach based on
tion is a distributed shared-memory (DSM) architecture. The GPUs and FPGAs. 32 Albireo nodes were added to Cygnus,
main advantage of this approach is that it allows applications each consisting of 4 NVIDIA V100 GPUs and two Intel
to directly address physical memory by transparently rely- Stratix 10 FPGAs. Similar to previous systems, a dedicated
ing on an OS network. Several tests [102] and benchmarks FPGA network was created with a 2D torus topology with
have validated the effectiveness of the platform, pushing the improved stream capabilities, called CIRCUS [113]. Col-
project forward into IoT and edge computing [103]. laboration between the FPGAs and GPUs is achieved by
Progress in this field has led to the creation of the Xilinx using a DMA engine in the FPGA that accesses the GPU
Adaptive Compute Clusters (XACC) [104] group under directly, bypassing the CPU, and offering almost double the
the Xilinx Heterogeneous Accelerated Compute Clusters throughput.
(HACC) [105] initiative. This industry and academic collabo- Finally, Fugaku [114], the first supercomputer to win
ration focuses on the development of new architectures, tools, all four categories in the Top500, presented a prototype
and applications for next-generation computers. FPGA cluster, ESSPER [115]. Motivated by the impres-
As part of this initiative, several clusters were built at some sive continuous improvements in FPGAs regarding energy
of the world’s most prestigious universities in Switzerland, and performance, a cluster of 8 nodes, each with two Intel
the USA, Germany, and Singapore. At the Paderborn Univer- Stratix 10 FPGAs, was built and tested. This cluster was
sity’s National High-Performance Computing Center (PC2), interfaced with Fugaku using a novel approach called loosely-
high-performance clusters Noctua [106] and Noctua 2 [107] coupled, where a host-FPGA bridging network provides
were built to provide hardware to accelerate research on interoperability and flexibility to all nodes in Fugaku.
computing systems with high energy efficiency.
The Noctua 2 cluster was designed to fit common server E. COMMUNICATION SYSTEMS INFRASTRUCTURE
racks and be compatible with the network industry standards. Another field of application where clusters of FPGAs are
It has 36 nodes with 2 AMD Milan. A combination of 48 relevant is the emulation of communication system infras-
Xilinx Alveo and 32 Intel Stratix 10 GX FPGAs comprised tructure. The most important difference with manycore
the reconfigurable computing part of the cluster. Each Stratix emulation is the need to interface with analog systems. This
node has 4 pluggable QSFP+ at 40 Gb/s and each Alveo has 2 requirement implies providing additional external ports to
QSPF+ at 100 Gb/s links and depends on Intel tools, such as interface with radio front-ends.
oneAPI [108], OpenCL, and DSP Builder. A specific optical One of the first implementations was the Berkeley Emu-
switch is used to build a configurable point-to-point network lation Engine (BEE) [117] in 2003. Its main purpose was
between all FPGAs. to support design space exploration for real-time algorithms,
More recently, Enzian [109] was developed as a scalable focusing mainly on data-flow architectures for digital signal
platform to fill the void left by industry-specific hybrid plat- processing.
forms. The reason behind Enzian provides a general, open, BEE was designed to emulate the digital part of telecom-
and affordable platform for research on hybrid CPU-FPGA munication systems and to provide a flexible interface for
computing, escaping the niche of specific-purpose hybrid radio front-ends. Computations are performed inside BEE
platforms by providing a lot of flexibility. Explicit access Processing Units (BPU). Each BPU has a main processing

VOLUME 11, 2023 67689


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

board (MPB) and 8 riser I/O cards for 2400 external signals. of CAD. Relying on industry specialists for PCB design has
The MPBs are the main computing boards hosting 20 Xilinx resulted in simpler and more reliable PCBs within a shorter
Virtex FPGAs, 16 zero-bus turnaround (ZBT) SRAMS, and project time horizon. In addition, it was possible to parallelize
8 high-speed connectors. FPGAs on the periphery of the the design process, allowing the academic community to
board have off-board connectors to link other MPBs. A hybrid focus on firmware development.
network consisting of a combination of a mesh network and The BEE collaboration presented its final iteration in 2010,
partial crossbar, called a hybrid-complete graph and a partial consisting of BEE4 and miniBEE [126], [127]. BEE4 was
crossbar (HCGP) [118], was implemented. A single-board updated to support Virtex 6 FPGAs and up to 128 GB of
computer (SBC) running Apache web services over Linux DDR3 RAM per module. The QSHs were removed in favor
allows users to deploy their applications and perform config- of FMC connectors to support a wider range of mezzanine
uration and slow control tasks. boards. BEE4 was built around the Honeycomb architecture
To take full advantage of the platform, an automated high- using the Sting I/O intermodule communication protocol.
level workflow was used [119] that relied on MATLAB The design tools were further refined to include Nectar OS
and Simulink to develop the main hardware blocks. The and BeeCube Platform Studio in MATLAB/Simulink, which
BEE compiler then processes the output and generates the are unfortunately proprietary. However, being a proprietary
required VHDL files for the simulation and configuration of system did not discourage its use in academia [128]. The
the system. A time-division multiple access (TDMA) receiver success of BEECube attracted further interest from the indus-
was fully implemented to satisfy real-time requirements and try, and was bought by National Instruments in 2015 [129].
validate the workflow. Today, it is a part of the FlexRIO [130] line-up, and soft-
Following the BEE success, the BEE2 [120] was conceived ware development is supported by NI tools. From this point
as a universal, standard reconfigurable computing system onward, almost all implementations depend on commercially
consisting of 5 Virtex 2 FPGAs, each with 4 DIMM connec- available emulation platforms.
tors for up to 4GB of RAM. Four FPGAs are available for To demonstrate the scalability of such implementations,
computing, and one was reserved for control tasks. Pivoting the world’s largest wireless network emulator was built,
away from the HCGP, an onboard mesh was implemented Colosseum [131], which can compute workloads of 820 Gb/s
between the 4 computing FPGAs. Using high-speed links, and perform 210 T operations per second. It was formed
it was possible to aggregate the 5 FPGAs and use them as by CUs that consisted of three FPGAs in a chain. The
a single, larger FPGA. The workflow remained almost the outer FPGAs were used to interface with the radios and
same for BEE2, with the main change being the use of a provide some processing. The central FPGA is dedicated
computational model of synchronous data flow for both the to digital signal processing. Commercially available solu-
microprocessor and FPGA. tions were selected to avoid complications when designing
To overcome the shortcomings of BEE2 and take advantage the custom board. For the radio-attached FPGAs, 128
of the already validated Spirit architecture [121], a digital USRP-X312 [132] software-defined radios were used. Each
wireless channel emulator (DWCE) [122], [123] was devel- provides the analog interfaces required for the antennas,
oped. It consisted of 64 nodes in the same way as Spirit, along with a Kintex 7 FPGA. As dedicated processing
but with valuable upgrades to demonstrate the capabilities of FPGAs, 16 NI ATCA-3671 [133] modules were used, each
FPGA clusters with military radios. Its capabilities improved hosting 4 FPGAs. The 64 processing FPGAs were intercon-
with an upgraded FPGA, additional 2 FMC connectors, and nected in a 4 × 4 × 4 HyperX topology [134] which allowed
the adoption of a standard MicroTCA.4 form factor. the data to be efficiently distributed for processing.
Considering the possible improvements to BEE and The NI modules are based on the BEE architecture and
because it was being developed as part of the research accel- support the same development tools. Given the complexity
erator for multiple processors (RAMP) community [124], of the system, a Python data-flow emulator [135] was built
a fast response was presented in the form of BEE3 [116]. The to confirm the topology and architecture of the system. It is
development of BEE3 differed from previous iterations and possible to confirm the latency of the system by providing
successfully demonstrated a new collaboration methodology models of the implemented components and topology.
between industry and academia [125]. Another notable contribution of this study is the proposal
The architecture of BEE3 changed substantially from that of a data flow methodology [136]. It comprises three guiding
of its predecessor by removing the control FPGA and intro- principles that highlight the issues present in other implemen-
ducing a control module on a smaller PCB. Another important tations. The first principle is the use of a unified interface
aspect worth highlighting is that, for the first time, a PCB was for modular components to favor portability. Second, when
intentionally developed to support different FPGA parts, all dealing with heterogeneous systems, the suggested approach
interconnected using a DDR2 interface in a ring topology. is asynchronous processing to decouple operations from time
The BEE3 prototype had approximately 30 collaborators, and favor parallelization. Finally, based on design best prac-
most of whom were professionals with extensive knowledge tices, solutions are urged to be vendor-independent.

67690 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

TABLE 5. General-purpose clusters’ contributions, reported power and performance gains.

VOLUME 11, 2023 67691


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

TABLE 6. Communication systems emulation clusters’ contributions, reported power and performance gains.

isolation tools that protect hardware from misbehavior, which


are discussed in [137], [138] but go beyond programming
languages, APIs, libraries, etc. All the cited works provide an
overview of the tools available to the user and the intended
workflow. Depending on the target application these tools
vary in scope, flexibility, and complexity. With a study of all
previous contributions, it is possible to build a wide base that
helps understand the greatest challenges, future trends, and
real capabilities of heterogeneous supercomputing.

A. NETWORK
A cluster is no more than a set of computational ele-
ments (CE) that collaborate toward a common goal. The
collaboration method and its means are crucial for ensur-
FIGURE 7. Main concepts for the proposed classification of clusters. ing implementation efficiency. The means of collaboration
branch out from the hardware interfaces to the commu-
nication protocols and, ultimately, the schedulers or other
II. CLASSIFICATION methods of synchronization. With this consideration, we can
After studying each of the works described above, it was draw a line between systems that delegate communication
possible to identify common elements. These elements reflect tasks to an external entity and those that incorporate the stack.
the decisions made by the designers when conceiving each Another important aspect of the interconnection is how it is
cluster. Given that heterogeneous computing is broad and handled. In high-speed stream computing, it is desired that
complex, until now no universal methodology has been devel- the communication be established as direct data channels
oped to design a cluster. A classification system was proposed with back-pressure, and this particular aspect is difficult to
in [94] based on the uniformity of the system and its nodes. replicate with purely routed networks.
We proposed segmenting the cluster infrastructure into Table 7 shows several aspects that distinguish the imple-
three main components, as shown in Figure 7. The first aspect mentations concerning network infrastructure for all the
is the network. This covers the physical interfaces chosen to works presented in Section I. The manner in which nodes
connect the nodes, logic protocols, and topologies. Another are connected is discriminated according to the existence of
important aspect to consider is the hardware available in the any additional hardware that processes, redirects, or interprets
CU. Each CU can have more than one CE type. Finally, a stream of data or packages between adjacent nodes as an
software tools that allow the cluster to be securely available indirect interconnection. This implies that a direct intercon-
to users for development were considered. They encompass nection is such that one node can interact directly with the

67692 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

TABLE 7. Network infrastructure.

nearest neighbor without the need for additional networking performance depended on the size of the packet. For smaller
hardware, excluding physical interfaces. In these implemen- packets, the latency was dominant, tipping the scale in favor
tations, network services are provided by in-fabric routers of direct interconnection, but for larger packets (> 1 MB),
and switches, which allow users to experiment with different the switched implementation offered an improvement of
protocols at the expense of resources. This is particularly approximately 5%.
crucial in implementations targeting heavy communication To compare the impact of both network connections, a leaf-
problems that require low latency. By no means, the depen- spine topology was implemented for the switched network,
dence on external hardware imposes a disadvantage because and a ring topology for direct interconnection. The switched
recent implementations show that it is capable of extending network was modeled for 2048 FPGAs using 64 radix
scaling capabilities without affecting performance, as in the switches. The ring topology was simulated for a direct net-
case of [25], which effectively interconnects thousands of work. These simulations showed that for a small message size
CEs. As shown in [141], adding dedicated network hardware (≈ < 1 MB), a direct network offers a shorter transmission
increases the latency by a constant factor. To determine the time than a switched network, regardless of the number of
impact on performance, a ring topology was implemented nodes. In contrast, larger payloads (> 227 MB) benefit from
using the E40G protocol. The experiments showed that the a switched network, but only up to 1024 nodes when the direct

VOLUME 11, 2023 67693


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

network transmission time catches up. These results show


that one approach is not necessarily better, but that it comes
down to the specificity of each cluster.
Similarly, the topology of the clusters is a decisive design
factor. Given that nodes are desired to have the highest
throughput with the lowest latency, most researchers have
opted for a tightly interconnected topology such as a mesh
or two-level mesh (TLM). These are great at providing a
consistent distance between the nodes in the system, but they
strongly affect larger systems. Another popular topology is
torus, 2D, or 3D. It has the advantage of limiting the longest
distance between nodes; however, as mentioned before, this
distance continues to grow as more nodes are added to the sys-
tem. The system should also keep a uniform shape and nodes
should be added to fill columns or rows to avoid introducing
inconsistencies to latency. Later works, such as Noctua and
Enzian, are not bound by a fixed topology. In particular,
Noctua’s infrastructure provides an optical switch capable of
implementing different topologies in runtime based on user
FIGURE 8. 13 dwarfs (table 1) mapped-out according to highest affinity
requests. Naturally, given that this is an external device, it can computational element (CE); the superscript ˆ refers to floating point and
be implemented in any indirectly connected cluster. For some ∗ to fixed point [17].
of the directly interconnected clusters, it may be possible to
add an external switch, but only if a standard protocol and
interface are used, such as Novo-G and Novo-G#. in Figure 6. The Uniform-Node Uniform-System (UNUS)
The strong relationship between the interface and the corresponds to a homogeneous cluster and is typically formed
protocol is hard to break, and it is rare to find a reason by CPUs. However, we can also find FPGA implementations
good enough to do this. For the same reason, most MGT such as Formic [39], Janus I [51], and Janus II [63]. This
implementations are based on the Aurora protocol or similar approach has the advantage that a single programming model
for the physical layer, and the data link relies on Ethernet. can be applied to the entire system but is not restricted by
However, Bluehive challenges this reasoning by implement- it. Given that an FPGA is by nature a heterogeneous device,
ing eSATA over PCIe connectors because PCIe was the this is not the case. Uniformity significantly simplifies the
only high-speed connector available on the FPGAs nodes. management and maintenance of clusters. Interestingly, the
Other particular design choices include the implementation advantages have not critically outweighed the disadvantages,
of DDR over GPIOs in same-board communications; this is given that several other studies have explored other more
the case for BEE, BEE2, and BEE3. It can be appreciated complex approaches. Performance-wise, there is a lot to win
that, considering the complexity of bringing up one of these when dealing with non-uniform nodes or systems. As shown
systems, standard protocols, and interfaces have been more in Table 1, all the computing problems can be classified into
favored. This stems from the fact that proven technologies 13 categories. By carefully studying the affinity between each
shorten design times, allowing developers to focus on other category of problems and the resources encapsulated in each
issue. CE, ideal candidates to solve each problem can be found. This
means that by mixing and matching, a heterogeneous clus-
B. HARDWARE ter may offer performance advantages over a homogeneous
When studying a cluster’s hardware, it is helpful to divide it cluster.
into its computational units (CU). A CU is any entity that is Figure 8 shows how each of the problems is mapped to
available for computing and is the smallest independent func- CEs depending on their characteristics [17]. This map shows
tional part of the cluster. According to this definition, devices that there are clear advantages in using one type of CE over
that act as pure network appliances, routers, or switches are another, mainly between GPU and FPGA. This is further
not considered. A CU can be composed of multiple CEs. Over supported by the implementations of Chimera [17] and Green
time, smaller CUs are preferred when dealing with general- Flash [140]. In both cases, GPUs and FPGAs are intended to
purpose clusters, whereas specific problems can benefit from be used as collaborative CEs. However, this new paradigm
larger CUs with an ad hoc network topology. makes development much more complicated, not because
In the context of Axel, [94], a classification structure was of the lack of tools for FPGA and GPU co-processing, but
proposed. It focuses on identifying the nodes, which in this also because of the required radical change of mindset when
paper we refer to as CUs, to avoid confusion with network leaving the traditional CPU plus accelerator context. This is
nodes, by their CEs, and the way in which they are distributed reflected in the reported applications of QP [91] in which
in the system. Four different types were considered, as shown users used only a combination of CPU plus GPU or FPGA.

67694 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

Table 8 shows the most relevant studies classified accord- internal gates, GPIOs, or MGTs. Typically, in HPC, this
ing to the characteristics of their CU. The Axel classification is not out for discussion, given that CPUs and GPUs have
system was used to identify the uniformity of the node (CU) fixed interfaces, but FPGAs are not bounded by this. Thus,
and system. The total number of CUs is also presented. Some the communication layer can be either available for users
studies have presented the architecture of a single node as to freely customize and test or fixed by the developers and
a building block for a future cluster. These are considered provided as a service. In addition, we have actual CEs; these
relevant for their contribution to the study of heterogeneous can be CPU, GPU, or FPGA custom cores. It is in this
workloads. The form factor of each work is shown in the part where actual computing is performed, and users may
respective column. Given that these are heterogeneous sys- be able to define the entities, or developers may provide
tems, different types of CE may be present at CUs, and programmable blocks. To interact safely with these block
sometimes even among CUs. Finally, Table 8 shows the total drivers, a file system and a scheduler may be provided as
number of CEs implemented in the system. an operating system. This creates a safe space for users to
As previously described, classification based on node (CU) build applications based on the hardware and communication
and system uniformity is useful for understanding the pro- services. At this level, users must rely on a programming
gramming paradigm. According to the previous definition of language that describes how the underlying parts cooperate
nodes, multiple CEs can be hosted on a single node. A balance for the intended computation. Some studies have presented
between a crowded or simple node resides in the diversity of new programming languages that aim to capture the different
the CEs and network infrastructure. Diverse CEs in a single programming paradigms in heterogeneous clusters. Tools that
node allow for the highest resource availability per CU for take the abstract description of the computation and transform
developers, but it remains a challenge to interface all devices it into instructions may be provided as a contained solution or
considering all the different ports. This leads to different along libraries and APIs to facilitate development. Another
form factors that directly impact the way the cluster scales, level of abstraction may be introduced, in which users interact
and more importantly, the availability of physical structures with prebuilt blocks in a fixed context inside a GUI.
to hold the nodes in place and provide efficient cooling. Table 9 shows a series of works with the development
Custom CU form factors usually host several CEs and can tools provided and the intended application. Depending on
compromise not only scalability, but also fault tolerance. This the scope of the application, user needs vary and may require
is the case for ARUZ [25], where a single node hosts up deeper access to the system or more abstract tools. Most
to 11 FPGAs. In an unfortunate case where one or more CEs general-purpose clusters are intended to be used as research
break down, the OS must be notified to circumvent these platforms. This requirement relaxes many management appli-
nodes or completely ignore them until they are fixed. In this cations and abstraction layers that, in turn, must be provided
regard, COTS clusters have a great advantage: the up-bring to the user in other cases. As research could take part in
cost is mostly absorbed by the industry by providing tested the lowest level of communication, users may need the free-
and validated nodes for quick installation, which is the case dom to change the electrical standard of the GPIOs or the
for Noctua 2 [107] and Catapult [22], among others. Some encoding of the MGT. These properties are only available if
rare cases of industry and academia collaboration greatly ben- the user sees the platform as a bare metal solution, or if the
efit from COTS advantages with specific research-motivated development environment has a standardized way of defining
modifications, as in Novo-G, [97] and BEE3 [125]. communication devices. In any case, most systems avoid
this by providing the user with a template that abstracts the
C. SOFTWARE TOOLS communication layer. Specific application clusters seek to
Finally, each work discussed would be incomplete if there encapsulate most of the details such that the user faces only
were no tools available to help users develop their appli- the challenges related to the application.
cations. These tools provide different layers of isolation, The flexibility of the software stack also depends on the
ranging from templates that encapsulate internode com- platform’s openness. In this regard, FPGA development
munication to complete operating systems that manage frameworks have been significantly delayed as opposed to
multiple-user access. Each of the tools offers a degree of CPU and GPU. Currently, one can use complete open-source
abstraction encapsulating all underlying details to offer ser- frameworks to develop applications for CPUs and GPUs,
vices to the user or to a higher layer. The depth of the layer but FPGAs are radically different. One reason for this is
stack depends on several factors: that the stack of tools is fundamentally different. Instead of
targeting fixed hardware through a well-known and well-
• Purpose of the cluster defined instruction set architecture (ISA), FPGA tools target
• Degree of freedom intended for the user (isolation) configuration memory with architecture-specific informa-
• Cluster flexibility tion. These architectural details tend to be industry secrets
A stack of tools can be structured according to the ser- that force developers to rely on vendor tools with all their
vices provided and required, as shown in Figure 9. First, benefits and limitations. One of the most important limita-
we have the interface with the external worldatn a physical tions is the proprietary nature of some vendor tools. Efforts
level. Naturally, we rely on electric signals controlled by have been made to create completely open-source workflows

VOLUME 11, 2023 67695


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

TABLE 8. Hardware architecture of computation units (CU) with respect to their computational elements (CE).

67696 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

TABLE 8. (Continued.) Hardware architecture of computation units (CU) with respect to their computational elements (CE).

such as F4PGA [8], in which experienced users can actively III. OPEN PROBLEMS AND TRENDS
collaborate to improve the platform. Finally, an important Supercomputing is a complex and fast-evolving field in which
aspect directly tied to the application is the level of flexibility CPUs and GPUs have traditionally dominated the market.
provided by the cluster. Some applications can implement Several successful attempts have been made to introduce
external hardware for optional data streams. This is the case FPGAs in this context, such as F1 instances in Amazon and
for all the BEE implementations. Other domains of flexibility the IBM cloud FPGA service. The flexibility and energy
include the network topology and communication protocol of efficiency of FPGAs strongly challenge CPU and GPU for
the cluster. The portability of the framework was described the same computing tasks, further motivating research in this
by the flexibility of the CEs. This means that the cluster CEs area.
could potentially be updated or changed without requiring The opportunities that FPGAs offer to heterogeneous
important modifications of the development tools, future- computing are huge. As already shown in several stud-
proofing them, and providing customization depending on the ies [14], [142], [143], [144], [145], FPGAs can surpass CPU
user’s needs. energy efficiency by orders of magnitude by relying on

VOLUME 11, 2023 67697


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

In addition, it offers flexibility. Depending on the inter-


face, the topology can be modified in runtime, as in [107]
or fixed, as in [64]. For a fixed topology, the 3D torus
allows the best physical interconnection at the expense
of non-uniform latency if the scaling is not symmetric,
as shown in [146]. The alternative of a virtual circuit net-
work over indirect connections shows promise, as shown
by VCSN [148] which allows a flexible virtual topology
with a performance similar to or better than directly
connected networks.
• Protocol: As shown in Table 7, custom protocols
remain relevant, suggesting that no industry standard
completely satisfies the requirements of heterogeneous
computing. This is partially owing to the flexibility of
FIGURE 9. Tool stack divided into different levels depending on the user FPGAs, which allows developers to optimize the proto-
isolation from the hardware.
col for latency, as shown in [141]. The drawback is that
a more flexible protocol will have a greater complexity,
hardware-level customization. FPGAs also offer the high- impacting routers, decoders, encoders, and ultimately
est degree of control to developers, allowing optimization the network’s throughput and latency.
at a logic level that is impossible with CPUs and GPUs.
The hardware that constitutes a cluster is another point
In addition, modern SoC-FPGAs offer internal high-speed
of discussion. This is closely related to the network, given
connections that allow CPUs, GPUs, and FPGAs to interact
that the possible interfaces, topologies, and protocols are
on the same die, thereby reducing communication latency and
constrained by the selected hardware. The contribution of
power consumption.
Chimera [17] in which they defined the ideal hardware to
However, as has been shown in this paper, having hard-
defeat each of the 13 dwarfs [16] confirms the advantage of
ware does not mean that the problems are solved. One of
heterogeneous nodes, particularly FPGAs, with the inclusion
the biggest obstacles to mass adoption is the lack of hard-
of CPUs or GPUs. This points directly to the SoC-FPGA,
ware abstraction and efficient synthesis tools, which increases
which in most cases includes all CEs in a single chip. The
development time [14], [144] compared to CPUs and GPUs.
main trends identified correspond to an increasing preference
To identify the specific challenges, we divided the imple-
for
mentation of a cluster into three areas: network, hardware,
and software tools. For each area, we performed a study to • CU with fewer CEs
recognize the trends and obstacles. • SoC-FPGAs as the heterogeneous part of the system
The network aspect of clusters has rapidly evolved and • CU standard form factor
is mainly driven by telecommunications. Faster and more Having fewer CEs reduces the CU cost, which is important
efficient communication platforms are always positive, but for the scalability and maintenance of the cluster. A standard
their implementation in heterogeneous computing has not form factor facilitates integration in current supercomput-
been obvious, given the existing trade-offs. This part is often ing centers by relying on common structural and thermal
fixed in the design process, and its drawbacks are widespread solutions.
throughout the service stack. From the studied implementa- ‘‘A tool is only as good as the hands that wield it,’’ is
tions, the following trends and trade-offs were identified: a common saying. In this case, quite often the tool lacks
• Interface: The advantage of a standard interface such a handle from which to wield it. Software tools are crucial
as MGTs greatly reduce development efforts at the cost for the usability of any computer, particularly heterogeneous
of reducing flexibility. Alternatively, SoC-FPGAs offer systems that present a new paradigm that makes development
numerous GPIOs that can be used at hundreds of mega- much more complicated. This has been the Achilles heel of
hertz, catching up with the throughput of MGT. The most implementations, and is one of the reasons that general
flexibility offered by GPIOs allows the development of adoption has yet to occur. We recognize that there are some
custom protocols, such as the time-division multiplexed missing pieces that represent open challenges:
TDM proposed in [146]. Development time has pushed • Hardware abstract model for different heterogeneous
designers to embrace well-defined interfaces, which are platforms.
usually constrained by the selected vendor, complicating • Standard interfaces for portability.
porting. However, the need for portable interfaces and • Flexible design tools to optimize implementations tar-
systems is addressed in [147] by introducing Kyokko, geting heterogeneous clusters.
which is a vendor-independent MGT controller. • Open source tools for community-driven development
• Topology: This aspect directly impacts the maximum to further accelerate adoption.
transmission rate and data throughput in the cluster. • Operating Systems for cluster management.

67698 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

TABLE 9. Target application and development tools.

VOLUME 11, 2023 67699


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

• Runtime performance analysis tools to identify layers for user interaction and management, showing great
bottlenecks. improvement in the execution phase. Likewise, several FPGA
In 2008, the strategic infrastructure for reconfigurable operating systems have been developed [159], [160], [161]
computing applications (SIRCA) [149] provided a com- implementing abstractions such as threads. Nevertheless,
prehensive study of the tools required for the adoption of some challenges remain, notably:
mainstream reconfigurable computing. This study separated • Translation tools capable of targeting scalable heteroge-
the tools based on four relevant phases: formulation, design, neous platforms
translation, and execution. • High-level prediction tools for performance, energy con-
The initial phase, in which the algorithms are elaborated sumption, and resource utilization, among others
and optimized for parallel computing, is referred to as the • Universal debugging and verification tools for dis-
formulation. This is the highest level of abstraction, mostly tributed reconfigurable computing
dealing with pseudo-codes and verbal language for reasoning.
Even if there are platform-specific solutions to some of
SIRCA highlighted the need for tools that aid developers
the previously mentioned challenges, the real challenge is
in making strategic decisions that favor the parallel model
to develop standard and generic solutions suitable for any
embedded in heterogeneous computing rather than leaving
heterogeneous cluster implementation in a community-driven
the decisions to the later phases. Formulation is the most
development approach that would greatly accelerate adoption
critical step in which researchers can benefit the most from
and growth, as shown in [162], [163], and [164]. At high-level
insight into the paradigm present in the targeted hetero-
synthesis, novel frameworks provide a convenient approach
geneous system. Tools that provide strategic exploration,
by including off-chip synchronization and communication
high-level prediction, and numerical analysis have a strongly
APIs, such as Auto-Pipe [165] in 2010 and, more recently,
positive impact on the other phases.
OpenFC [166] and SMAPPIC [167].
The design phase consists of the languages used to translate
an algorithm into a behavioral implementation. This field has
been broadened by the creation and adaptation of modern IV. CONCLUSION
HDL languages, such as Chisel [150], [151] based on Scala Supercomputers have been growing in recent years to occupy
and Clash [152], [153] based on Haskell, and high-level syn- large areas and consume as much energy as small towns. This
thesis tools, such as BondManchine based on Go [154] among trend is impossible to support, and highlights the major issues
several others. New developments have solved, to some of the current approach based on CPUs and GPUs. Mean-
degree, the issues of portability and interoperability by raising while, FPGA-based heterogeneous platforms have shown
abstraction. However, the method for scaling designs to het- great improvements in performance and energy consumption
erogeneous clusters remains platform-specific. Without these when compared to their CPU or GPU counterparts. Nonethe-
facilities, users are expected to be responsible for porting less, adoption has remained low, primarily owing to the
and partitioning the design. Furthermore, users are tasked complexity of hardware design and the lack of standards for
with specifying the concurrency model at the system level, interconnection, structure, and program description, to men-
which is a difficult task. An in-depth study of design tools, tion some that affect most development tools by forcing
frameworks, and strategies for design space exploration can over-specification.
be found in [155]. By studying the most relevant implementations of FPGA
Once a PC-compatible description of the algorithm is avail- heterogeneous clusters, we propose three main domains in
able, the next phase maps it to the actual physical resources each cluster, namely, network, hardware, and software tools,
of the system. This phase is known as translation or place- that help recognize the contributions and challenges of each
and-route (PAR). Several improvements were made in recent work. Furthermore, studying a specific cluster architecture
years [156], [157]. Most focus on the speed-up of the process under this division aids in identifying the origin of some
by implementing parallelism with good results when com- issues and understanding the compromises of design deci-
pared with vendor tools. However, these improvements are sions taken in the different domains. By understanding the
not easily integrated into proprietary workflows and require trade-offs related to each decision, developers can better
a high level of expertise for effective usage. Likewise, exist- anticipate the critical issues in each domain to plan contin-
ing PARs targeting clusters are platform-specific and do not gency measures in the most convenient manner. This survey
change until a standard way of describing a heterogeneous sheds light on the open challenges that future clusters will
system is adopted. have to overcome but also offers an overview of the already
In the final phase, execution, developers must be able to available and tested approaches.
verify and analyze the performance of the implementation. FPGA-based heterogeneous computing is a challenging
Critical runtime services must be included, such as task man- field, with enormous potential to change the dominant
agement, checkpoints, heartbeats, and debugging services. computing paradigm. In later years, great interest brought
The effective implementation of such services depends on important contributions to the development of tools and,
their consideration in previous design phases. The works more importantly, experimental platforms. With standard
studied in detail in [158] provided definitions of abstraction platform descriptions and interfaces, an open collaborative
67700 VOLUME 11, 2023
W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

development approach will allow the creation of commu- [20] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and
nities to accelerate adoption. New technologies, such as K. Skadron, ‘‘Rodinia: A benchmark suite for heterogeneous comput-
ing,’’ in Proc. IEEE Int. Symp. Workload Characterization (IISWC),
SoC-FPGAs, will certainly be at the center of future cluster Oct. 2009, pp. 44–54, doi: 10.1109/IISWC.2009.5306797.
architectures, considering the advantages of having CPUs, [21] Virginia Tech Synergy. (2019). GitHub—VTSynergy/OpenDwarfs: A
GPUs, and FPGAs in the same device. Benchmark Suite. [Online]. Available: https://github.com/vtsynergy/
OpenDwarfs
[22] A. Putnam et al., ‘‘A reconfigurable fabric for accelerating large-scale
ACKNOWLEDGMENT datacenter services,’’ in Proc. ACM/IEEE 41st Int. Symp. Comput. Archit.
The authors would like to thank Romina Soledad Molina and (ISCA), Jun. 2014, pp. 13–24, doi: 10.1109/ISCA.2014.6853195.
[23] Alibaba. (2018). Deep Dive Into Alibaba Cloud F3 FPGA as
Charn Loong Ng for their valuable insight in the process of a Service Instances—Alibaba Cloud Community. [Online]. Avail-
writing this article. able: https://www.alibabacloud.com/blog/deep-dive-into-alibaba-cloud-
f3-fpga-as-a-service-instances_594057
[24] Amazon. (2017). Amazon EC2 F1 Instances. [Online]. Available:
REFERENCES https://aws.amazon.com/ec2/instance-types/f1/
[1] C. Maxfield. (Sep. 2011). Who Made the First PLD?—EETimes. [Online]. [25] R. Kiełbik, K. Hałagan, W. Zatorski, J. Jung, J. Ulański, A. Napieralski,
Available: https://www.eetimes.com/who-made-the-first-pld/ K. Rudnicki, P. Amrozik, G. Jabłoński, D. Stożek, P. Polanowski,
[2] (2017). Xilinx Co-Founder Ross Freeman Honored—EETimes. [Online]. Z. Mudza, J. Kupis, and P. Panek, ‘‘ARUZ—Large-scale, massively
Available: https://www.eetimes.com/xilinx-co-founder-ross-freeman- parallel FPGA-based analyzer of real complex systems,’’ Comput.
honored/ Phys. Commun., vol. 232, pp. 22–34, Nov. 2018. [Online]. Available:
[3] Xilinx. (2021). Vivado Design Suite User Guide, Version 2021.1. https://linkinghub.elsevier.com/retrieve/pii/S0010465518302182, doi:
[Online]. Available: https://www.xilinx.com/support/documentation/sw_ 10.1016/j.cpc.2018.06.010.
manuals/xilinx2021_1/ug973-vivado-release-notes-install-license.pdf [26] F. Fahim et al., ‘‘hls4ml: An open-source codesign workflow to
[4] Xilinx. (2021). Vitis Unified Software Platform User Guide, Version empower scientific low-power machine learning devices,’’ 2021,
2021.1. [Online]. Available: https://www.xilinx.com/support/document arXiv:2103.05579.
ation/sw_manuals/xilinx2021_1/ug1416-vitis-unified-platform.pdf [27] J. Villarreal, A. Park, W. Najjar, and R. Halstead, ‘‘Designing modu-
[5] Intel Corporation. (2021). Quartus Prime User Guide, Version lar hardware accelerators in C with ROCCC 2.0,’’ in Proc. 18th IEEE
21.1. [Online]. Available: https://www.intel.com/content/dam/www/ Annu. Int. Symp. Field-Program. Custom Comput. Mach., May 2010,
programmable/us/en/pdfs/literature/ug/ug-qps.pdf pp. 127–134, doi: 10.1109/FCCM.2010.28.
[6] Microsemi Libero. (2021). Libero SoC Design Suite User Guide, Version [28] R. Nane, V. Sima, B. Olivier, R. Meeuws, Y. Yankova, and K. Bertels,
12.0, 2021. [Online]. Available: https://www.microsemi.com/document- ‘‘DWARV 2.0: A CoSy-based C-to-VHDL hardware compiler,’’ in
portal/doc_view/131953-libero-soc-design-suite-v12-0-user-guide Proc. 22nd Int. Conf. Field Program. Log. Appl. (FPL), Aug. 2012,
[7] (2021). Yosys Open SYnthesis Suite. Accessed: May 9, 2023. [Online]. pp. 619–622, doi: 10.1109/FPL.2012.6339221.
Available: https://github.com/YosysHQ/yosys [29] A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong,
and W.-M.-W. Hwu, ‘‘Efficient compilation of CUDA kernels for
[8] CHIPS Alliance. (2017). FOSS Flows for FPGA—F4PGA
high-performance computing on FPGAs,’’ ACM Trans. Embedded Com-
Documentation. [Online]. Available: https://f4pga.readthedocs.
put. Syst., vol. 13, no. 2, pp. 1–26, Sep. 2013, doi: 10.1145/2514641.
io/en/latest/index.html
2514652.
[9] Agile Analog. (2021). RapidSilicon: Accelerating Silicon Development.
[30] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski,
Accessed: May 9, 2023. [Online]. Available: https://www.agileanalog.
S. D. Brown, and J. H. Anderson, ‘‘LegUp: An open-source high-level
com/products/rapidsilicon
synthesis tool for FPGA-based processor/accelerator systems,’’ ACM
[10] W. A. Najjar and P. Ienne, ‘‘Reconfigurable computing,’’ IEEE Micro, Trans. Embedded Comput. Syst., vol. 13, no. 2, pp. 1–27, Sep. 2013.
vol. 34, no. 1, pp. 4–6, Jan. 2014. [Online]. Available: https://dl.acm.org/ [Online]. Available: https://doi-org.ezproxy.cern.ch/10.1145/2514740,
doi/10.1145/508352.508353, doi: 10.1109/MM.2014.25. doi: 10.1145/2514740.
[11] Altera Corporation. (Jul. 2014). What is an SoC FPGA? Architecture [31] S. Lee, J. Kim, and J. S. Vetter, ‘‘OpenACC to FPGA: A frame-
Brief. [Online]. Available: http://www.altera.com/socarchitecture work for directive-based high-performance reconfigurable computing,’’
[12] W. Vanderbauwhede et al., High-Performance Computing Using FPGAs. in Proc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS), May 2016,
New York, NY, USA: Springer, 2013. pp. 544–554, doi: 10.1109/IPDPS.2016.28.
[13] M. Awad, ‘‘FPGA supercomputing platforms: A survey,’’ in Proc. [32] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee,
Int. Conf. Field Program. Log. Appl., Aug. 2009, pp. 564–568. J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and
[Online]. Available: http://ieeexplore.ieee.org/document/5272406/, doi: A. Agarwal, ‘‘Baring it all to software: Raw machines,’’ Computer,
10.1109/FPL.2009.5272406. vol. 30, no. 9, pp. 86–93, 1997. [Online]. Available: http://ieeexplore.
[14] K. O’Neal and P. Brisk, ‘‘Predictive modeling for CPU, GPU, and ieee.org/document/612254/, doi: 10.1109/2.612254.
FPGA performance and power consumption: A survey,’’ in Proc. IEEE [33] J. D. Davis, ‘‘FAST: A flexible architecture for simulation and testing
Comput. Soc. Annu. Symp. VLSI (ISVLSI), Jul. 2018, pp. 763–768, doi: of multiprocessor and CMP systems,’’ Dept. Elect. Eng., Stanford Univ.,
10.1109/ISVLSI.2018.00143. Stanford, CA, USA, Dec. 2006.
[15] P. Colella. (2004). Defining Software Requirements for Scientific [34] H. Kalte, M. Porrmann, and U. Rückert, ‘‘A prototyping platform for
Computing. DARPA HPCS. [Online]. Available: https://www.krellinst. dynamically reconfigurable system on chip designs,’’ in Proc. IEEE
org/doecsgf/conf/2013/pres/pcolella.pdf Workshop Heterogeneous Reconfigurable Syst. Chip (SoC), Hamburg,
[16] K. Asanovic et al., ‘‘The landscape of parallel computing research: Germany, Apr. 2002, pp. 57–75.
A view from Berkeley,’’ 2006. [Online]. Available: http://www.eecs. [35] M. Porrmann et al., ‘‘RAPTOR—A scalable platform for rapid prototyp-
berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html ing and FPGA-based cluster computing,’’ in Parallel Computing: From
[17] R. Inta, D. J. Bowman, and S. M. Scott, ‘‘The ‘Chimera’: An off-the-shelf Multicores and GPU’s to Petascale (Advances in Parallel Computing),
CPU/GPGPU/FPGA hybrid computing platform,’’ Int. J. Reconfigurable vol. 19. Amsterdam, The Netherlands: IOS Press, 2010, doi: 10.3233/978-
Comput., vol. 2012, pp. 1–10, Jan. 2012. [Online]. Available: http://www. 1-60750-530-3-592.
hindawi.com/journals/ijrc/2012/241439, doi: 10.1155/2012/241439. [36] C. Steffen and G. Genest, ‘‘Nallatech in-socket FPGA front-side bus
[18] R. D. Chamberlain, ‘‘Architecturally truly diverse systems: A accelerator,’’ Comput. Sci. Eng., vol. 12, no. 2, pp. 78–83, Mar. 2010, doi:
review,’’ Future Gener. Comput. Syst., vol. 110, pp. 33–44, 10.1109/MCSE.2010.45.
Sep. 2020. [Online]. Available: https://linkinghub.elsevier.com/retrieve/ [37] C. Pohl, C. Paiz, and M. Porrmann, ‘‘vMAGIC—Automatic code
pii/S0167739X19313184, doi: 10.1016/j.future.2020.03.061. generation for VHDL,’’ Int. J. Reconfigurable Comput., vol. 2009,
[19] R. Palmer. (2011). Parallel Dwarfs (Inaccessible). [Online]. Available: pp. 1–9, Jan. 2009. [Online]. Available: http://vmagic.sourceforge.net/,
http://paralleldwarfs.codeplex.com/ doi: 10.1155/2009/205149.

VOLUME 11, 2023 67701


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

[38] S. Lyberis, G. Kalokerinos, M. Lygerakis, V. Papaefstathiou, [55] W. Kastl and T. Loimayr, ‘‘A parallel computing system with special-
I. Mavroidis, M. Katevenis, D. Pnevmatikatos, and D. S. Nikolopoulos, ized coprocessors for cryptanalytic algorithms,’’ in P170—Sicherheit
‘‘FPGA prototyping of emerging manycore architectures for parallel 2010—Sicherheit, Schutz und Zuverlässigkeit, F. C. Freiling, Ed. Bonn,
programming research using formic boards,’’ J. Syst. Archit., vol. 60, Germany: Gesellschaft für Informatik, 2010, pp. 78–83. [Online]. Avail-
no. 6, pp. 481–493, Jun. 2014. [Online]. Available: https://linkinghub. able: https://dl.gi.de/handle/20.500.12116/19801
elsevier.com/retrieve/pii/S138376211400054X, doi: 10.1016/j.sysarc. [56] B. Danczul, J. Fuß, S. Gradinger, B. Greslehner, W. Kastl, and F. Wex,
2014.03.002. ‘‘Cuteforce analyzer: A distributed bruteforce attack on PDF encryption
[39] S. Lyberis, G. Kalokerinos, M. Lygerakis, V. Papaefstathiou, with GPUs and FPGAs,’’ in Proc. Int. Conf. Availability, Rel. Secur.,
D. Tsaliagkos, M. Katevenis, D. Pnevmatikatos, and D. Nikolopoulos, Sep. 2013, pp. 720–725. [Online]. Available: http://ieeexplore.ieee.
‘‘Formic: Cost-efficient and scalable prototyping of manycore org/document/6657310/, doi: 10.1109/ARES.2013.94.
architectures,’’ in Proc. IEEE 20th Int. Symp. Field-Program. Custom [57] A. H. T. Tse, D. B. Thomas, K. H. Tsoi, and W. Luk,
Comput. Mach., Apr. 2012, pp. 61–64, doi: 10.1109/FCCM.2012.20. ‘‘Dynamic scheduling monte-carlo framework for multi-accelerator
[40] H. Shah et al., ‘‘Remote direct memory access (RDMA) protocol exten- heterogeneous clusters,’’ in Proc. Int. Conf. Field-Program. Technol.,
sions,’’ Tech. Rep. 7306, Jun. 2014. Dec. 2010, pp. 233–240. [Online]. Available: http://ieeexplore.ieee.
[41] V. Kale, ‘‘Using the MicroBlaze processor to accelerate cost-sensitive org/document/5681495/, doi: 10.1109/FPT.2010.5681495.
embedded system development,’’ Xilinx, Jun. 2016. [Online]. Available: [58] G. Tan, C. Zhang, W. Wang, and P. Zhang, ‘‘SuperDragon,’’ ACM
https://docs.xilinx.com/v/u/en-US/wp469-microblaze-for-cost-sensitive- Trans. Reconfigurable Technol. Syst., vol. 8, no. 4, pp. 1–22, Oct. 2015.
apps [Online]. Available: https://dl.acm.org/doi/10.1145/2740966, doi:
[42] S. G. Kavadias, M. G. H. Katevenis, M. Zampetakis, and 10.1145/2740966.
D. S. Nikolopoulos, ‘‘On-chip communication and synchronization [59] S. W. Moore, P. J. Fox, S. J. T. Marsh, A. T. Markettos, and A. Mujumdar,
mechanisms with cache-integrated network interfaces,’’ in Proc. 7th ‘‘Bluehive–A field-programable custom computing machine for extreme-
ACM Int. Conf. Comput. Frontiers, May 2010, pp. 217–226, doi: scale real-time neural network simulation,’’ in Proc. IEEE 20th Int.
10.1145/1787275.1787328. Symp. Field-Program. Custom Comput. Mach., Apr. 2012, pp. 133–140.
[43] Cadence. (2019). Palladium Emulation | Cadence. [Online]. Available: [Online]. Available: https://ieeexplore.ieee.org/document/6239804/, doi:
https://www.cadence.com/en_US/home/tools/system-design-and- 10.1109/FCCM.2012.32.
verification/emulation-and-prototyping/palladium.html [60] P. J. Fox, A. T. Markettos, and S. W. Moore, ‘‘Reliably prototyping
[44] Siemens. (2022). Veloce Prototyping—FPGA | Siemens Software. large SoCs using FPGA clusters,’’ in Proc. 9th Int. Symp. Reconfig-
[Online]. Available: https://eda.sw.siemens.com/en-US/ic/veloce/fpga- urable Commun.-Centric Syst.-on-Chip (ReCoSoC), May 2014, pp. 1–8.
prototyping/ [Online]. Available: http://ieeexplore.ieee.org/document/6861350/, doi:
[45] B. da Silva, A. Braeken, E. H. D’Hollander, A. Touhafi, J. G. Cornelis, 10.1109/ReCoSoC.2014.6861350.
and J. Lemeire, ‘‘Comparing and combining GPU and FPGA accelerators [61] A. Theodore Markettos, P. J. Fox, S. W. Moore, and A. W. Moore,
in an image processing context,’’ in Proc. 23rd Int. Conf. Field Program. ‘‘Interconnect for commodity FPGA clusters: Standardized or cus-
Log. Appl., Sep. 2013, pp. 1–4. [Online]. Available: http://ieeexplore. tomized?’’ in Proc. 24th Int. Conf. Field Program. Log. Appl. (FPL),
ieee.org/document/6645552/, doi: 10.1109/FPL.2013.6645552. Sep. 2014, pp. 1–8. http://ieeexplore.ieee.org/document/6927472/, doi:
[46] T. Otsuka, T. Aoki, E. Hosoya, and A. Onozawa, ‘‘An image recognition 10.1109/FPL.2014.6927472.
system for multiple video inputs over a multi-FPGA system,’’ in Proc. [62] R. S. Nikhil et al., BSV by Example, 10th ed. 2010. [Online]. Available:
IEEE 6th Int. Symp. Embedded Multicore SoCs, Sep. 2012, pp. 1–7. http://www.bluespec.com/support/
[Online]. Available: http://ieeexplore.ieee.org/document/6354671/, doi: [63] M. Baity-Jesi et al., ‘‘Janus II: A new generation application-driven com-
10.1109/MCSoC.2012.33. puter for spin-system simulations,’’ Comput. Phys. Commun., vol. 185,
[47] The RTN Collaboration, ‘‘64-transputer machine,’’ in Proc. CHEP, no. 2, pp. 550–559, Feb. 2014. [Online]. Available: https://linkinghub.
Geneva, Switzerland, 1992, pp. 353–360. elsevier.com/retrieve/pii/S0010465513003470, doi: 10.1016/j.cpc.2013.
[48] H. Schmit et al., ‘‘Behavioral synthesis for FPGA-based comput- 10.019.
ing,’’ in Proc. IEEE Workshop FPGA’s Custom Comput. Mach., 1994, [64] R. Kiełbik, K. Rudnicki, Z. Mudza, and J. Jung, ‘‘Methodology of
pp. 125–132, doi: 10.1109/FPGA.1994.315591. firmware development for ARUZ—An FPGA-based HPC system,’’ Elec-
[49] A. Cruz, J. Pech, A. Tarancón, P. Téllez, C. L. Ullod, and C. Ungil, tronics, vol. 9, no. 9, p. 1482, Sep. 2020. [Online]. Available: https://
‘‘SUE: A special purpose computer for spin glass models,’’ Com- www.mdpi.com/journal/electronics, doi: 10.3390/electronics9091482.
put. Phys. Commun., vol. 133, nos. 2–3, pp. 165–176, Jan. 2001, doi: [65] (2006). VHDL Preprocessor Home Page. [Online]. Available: https://
10.1016/S0010-4655(00)00170-3. sourceforge.net/projects/vhdlpp/
[50] F. Belletti, I. Campos, A. Maiorano, S. P. Gavir, D. Sciretti, A. Tarancon, [66] S. Karim, J. Harkin, L. McDaid, B. Gardiner, and J. Liu, ‘‘AstroByte:
J. L. Velasco, A. C. Flor, D. Navarro, P. Tellez, L. A. Fernandez, Multi-FPGA architecture for accelerated simulations of spiking astrocyte
V. Martin-Mayor, A. M. Sudupe, S. Jimenez, E. Marinari, F. Mantovani, neural networks,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE),
G. Poll, S. F. Schifano, L. Tripiccione, and J. J. Ruiz-Lorenzo, ‘‘Ianus: An Mar. 2020, pp. 1568–1573, doi: 10.23919/DATE48585.2020.9116312.
adaptive FPGA computer,’’ Comput. Sci. Eng., vol. 8, no. 1, pp. 41–49, [67] S. Yang, J. Wang, X. Hao, H. Li, X. Wei, B. Deng, and K. A. Loparo,
Jan. 2006. [Online]. Available: http://ieeexplore.ieee.org/document/ ‘‘BiCoSS: Toward large-scale cognition brain with multigranular neuro-
1563961/, doi: 10.1109/MCSE.2006.9. morphic architecture,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 33,
[51] F. Belletti et al., ‘‘Janus: An FPGA-based system for high- no. 7, pp. 2801–2815, Jul. 2022, doi: 10.1109/TNNLS.2020.3045492.
performance scientific computing,’’ Comput. Sci. Eng., vol. 11, no. 1, [68] D. Gratadour. (2021). Microgate—Green Flash. [Online]. Available:
pp. 48–58, Jan. 2009. [Online]. Available: https://ieeexplore.ieee.org/ http://green-flash.lesia.obspm.fr/microgate.html
document/4720223/, doi: 10.1109/MCSE.2009.11. [69] Y. Clénet et al. (2019). MICADO-MAORY SCAO Preliminary Design,
[52] M. Baity-Jesi et al., ‘‘An FPGA-based supercomputer for statistical Development Plan & Calibration Strategies. [Online]. Available:
physics: The weird case of Janus,’’ in High-Performance Computing https://hal.archives-ouvertes.fr/hal-03078430
Using FPGAs. New York, NY, USA: Springer, Mar. 2013, pp. 481–506. [70] A. Brown, D. Thomas, J. Reeve, G. Tarawneh, A. De Gennaro,
[Online]. Available: https://link-springer-com.ezproxy.cern.ch/chapter/ A. Mokhov, M. Naylor, and T. Kazmierski, ‘‘Distributed event-based
10.1007/978-1-4614-1791-0_16, doi: 10.1007/978-1-4614-1791-0_16. computing,’’ in Parallel Computing is Everywhere (Advances in Parallel
[53] S. Kumar, C. Paar, J. Pelzl, G. Pfeiffer, and M. Schimmler, ‘‘Breaking Computing), vol. 32. 2018, pp. 583–592. [Online]. Available: https://
ciphers with COPACOBANA—A cost-optimized parallel code breaker,’’ ebooks.iospress.nl/doi/10.3233/978-1-61499-843-3-583, doi: 10.3233/
in Proc. Int. Workshop Cryptograph. Hardw. Embedded Syst., in Lecture 978-1-61499-843-3-583.
Notes in Computer Science: Including Subseries Lecture Notes in Arti- [71] M. A. Petrovici, B. Vogginger, P. Müller, O. Breitwieser, M. Lundqvist,
ficial Intelligence and Lecture Notes in Bioinformatics, vol. 4249, 2006, L. Müller, M. Ehrlich, A. Destexhe, A. Lansner, R. Schüffny,
pp. 101–118, doi: 10.1007/11894063_9. J. Schemmel, and K. Meier, ‘‘Characterization and compensation of
[54] T. Güneysu, T. Kasper, M. Novotný, C. Paar, and A. Rupp, ‘‘Crypt- network-level anomalies in mixed-signal neuromorphic modeling plat-
analysis with COPACOBANA,’’ IEEE Trans. Comput., vol. 57, no. 11, forms,’’ PLoS ONE, vol. 9, no. 10, Oct. 2014, Art. no. e108590. [Online].
pp. 1498–1513, Nov. 2008. [Online]. Available: http://ieeexplore.ieee. Available: https://journals.plos.org/plosone/article?id=10.1371/journal.
org/document/4515858/, doi: 10.1109/TC.2008.80. pone.0108590, doi: 10.1371/journal.pone.0108590.

67702 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

[72] I. Ohmura, G. Morimoto, Y. Ohno, A. Hasegawa, and M. Taiji, [88] R. Baxter, S. Booth, M. Bull, G. Cawood, J. Perry, M. Parsons,
‘‘MDGRAPE-4: A special-purpose computer system for molecular A. Simpson, A. Trew, A. McCormick, G. Smart, R. Smart, A. Cantle,
dynamics simulations,’’ Philos. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., R. Chamberlain, and G. Genest, ‘‘Maxwell—A 64 FPGA supercom-
vol. 372, Aug. 2014, Art. no. 20130387. [Online]. Available: https://pmc/ puter,’’ in Proc. 2nd NASA/ESA Conf. Adapt. Hardw. Syst. (AHS),
articles/PMC4084528/ and https://pmc/articles/PMC4084528/?report= Aug. 2007, pp. 287–294. http://ieeexplore.ieee.org/document/4291933/,
abstract and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4084528/, doi: 10.1109/AHS.2007.71.
doi: 10.1098/RSTA.2013.0387. [89] R. Baxter, S. Booth, M. Bull, G. Cawood, J. Perry, M. Parsons,
[73] J. Weerasinghe, F. Abel, C. Hagleitner, and A. Herkersdorf, ‘‘Enabling A. Simpson, A. Trew, A. McCormick, G. Smart, R. Smart, A. Cantle,
FPGAs in hyperscale data centers,’’ in Proc. IEEE 12th Int. Conf. Ubiq- R. Chamberlain, and G. Genest, ‘‘The FPGA high-performance comput-
uitous Intell. Comput., IEEE 12th Int. Conf. Autonomic Trusted Comput. ing alliance parallel toolkit,’’ in Proc. 2nd NASA/ESA Conf. Adapt. Hardw.
IEEE 15th Int. Conf. Scalable Comput. Commun. Associated Workshops Syst. (AHS), Aug. 2007, pp. 301–307, doi: 10.1109/AHS.2007.104.
(UIC-ATC-ScalCom), Aug. 2015, pp. 1078–1086. [Online]. Available: [90] O. Mencer, K. H. Tsoi, S. Craimer, T. Todman, W. Luk, M. Y. Wong,
https://ieeexplore.ieee.org/document/7518378/, doi: 10.1109/UIC-ATC- and P. H. W. Leong, ‘‘Cube: A 512-FPGA cluster,’’ in Proc.
ScalCom-CBDCom-IoP.2015.199. 5th Southern Conf. Program. Log. (SPL), Apr. 2009, pp. 51–57.
[74] Xilinx. (2016). Xilinx and IBM to Enable FPGA-Based Acceleration [Online]. Available: http://ieeexplore.ieee.org/document/4914907/, doi:
Within SuperVessel OpenPOWER Development Cloud. [Online]. 10.1109/SPL.2009.4914907.
Available: https://www.xilinx.com/news/press/2016/xilinx-and-ibm-to- [91] M. Showerman, J. Enos, A. Pant, V. Kindratenko, C. Steffen,
enable-fpga-based-acceleration-within-supervessel-openpower- R. Pennington, and W.-M. Hwu, ‘‘QP: A heterogeneous multi-accelerator
development-cloud.html cluster,’’ in Proc. 10th LCI Int. Conf. High-Perform. Clustered Comput.,
[75] F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, and S. Paredes, Boulder, CO, USA, Mar. 2009, pp. 1–8.
‘‘An FPGA platform for hyperscalers,’’ in Proc. IEEE 25th Annu. [92] Xilinx. (2013). ISE Design Suite. [Online]. Available:
Symp. High-Perform. Interconnects (HOTI), Aug. 2017, pp. 29–32. https://www.xilinx.com/products/design-tools/ise-design-suite.html
[Online]. Available: http://ieeexplore.ieee.org/document/8071053/, doi: [93] A. Pant, H. Jafri, and V. Kindratenko, ‘‘Phoenix: A runtime environment
10.1109/HOTI.2017.13. for high performance computing on chip multiprocessors,’’ in Proc.
[76] J. Weerasinghe, F. Abel, C. Hagleitner, and A. Herkersdorf, ‘‘Disag- 17th Euromicro Int. Conf. Parallel, Distrib. Netw.-Based Process., 2009,
gregated FPGAs: Network performance comparison against bare-metal pp. 119–126, doi: 10.1109/PDP.2009.41.
servers, virtual machines and Linux containers,’’ in Proc. Int. Conf.
[94] K. H. Tsoi and W. Luk, ‘‘Axel,’’ in Proc. 18th Annu. ACM/SIGDA Int.
Cloud Comput. Technol. Sci. (CloudCom), Dec. 2016, pp. 9–17, doi:
Symp. Field Program. Gate Arrays, New York, NY, USA, Feb. 2010,
10.1109/CLOUDCOM.2016.0018.
p. 115. http://portal.acm.org/citation.cfm?doid=1723112.1723134, doi:
[77] B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner, and D. Fey, ‘‘Pro- 10.1145/1723112.1723134.
gramming reconfigurable heterogeneous computing clusters using MPI
[95] Adaptive Computing Enterprises. (2015). TORQUE Resource Man-
with transpilation,’’ in Proc. IEEE/ACM Int. Workshop Heterogeneous
ager Administrator Guide 4.2.10. [Online]. Available: http://www.
High-Perform. Reconfigurable Comput. (H2RC), Nov. 2020, pp. 1–9, doi:
adaptivecomputing.com
10.1109/H2RC51942.2020.00006.
[96] (2014). Maui Scheduler Administrator’s Guide. [Online]. Available:
[78] B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner, and D. Fey,
http://docs.adaptivecomputing.com/maui/
‘‘ZRLMPI: A unified programming model for reconfigurable hetero-
geneous computing clusters,’’ in Proc. IEEE 28th Annu. Int. Symp. [97] A. George, H. Lam, and G. Stitt, ‘‘Novo-G: At the forefront of scal-
Field-Program. Custom Comput. Mach. (FCCM), May 2020, p. 220, doi: able reconfigurable supercomputing,’’ Comput. Sci. Eng., vol. 13, no. 1,
10.1109/FCCM48280.2020.00051. pp. 82–86, Jan. 2011. [Online]. Available: http://ieeexplore.ieee.org/
[79] H. Shahzad, A. Sanaullah, and M. Herbordt, ‘‘Survey and future document/5678570/, doi: 10.1109/MCSE.2011.11.
trends for FPGA cloud architectures,’’ in Proc. IEEE High Per- [98] A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng,
form. Extreme Comput. Conf. (HPEC), Sep. 2021, pp. 1–10, doi: and C. Yang, ‘‘Novo-G#: Large-scale reconfigurable computing with
10.1109/HPEC49654.2021.9622807. direct and programmable interconnects,’’ in Proc. IEEE High Per-
[80] C. Bobda et al., ‘‘The future of FPGA acceleration in datacenters and form. Extreme Comput. Conf. (HPEC), Sep. 2016, pp. 1–7. [Online].
the cloud,’’ ACM Trans. Reconfigurable Technol. Syst., vol. 15, no. 3, Available: http://ieeexplore.ieee.org/document/7761639/, doi: 10.1109/
Sep. 2022, Art. no. 34, doi: 10.1145/3506713. HPEC.2016.7761639.
[81] R. Sass, W. V. Kritikos, A. G. Schmidt, S. Beeravolu, and [99] Xilinx. (Oct. 2017). Interlaken 150G. [Online]. Available:
P. Beeraka, ‘‘Reconfigurable computing cluster (RCC) project: https://docs.xilinx.com/v/u/en-US/pg212-interlaken-150g
Investigating the feasibility of FPGA-based petascale computing,’’ [100] R. Giorgi, ‘‘AXIOM: A 64-bit reconfigurable hardware/software plat-
in Proc. 15th Annu. IEEE Symp. Field-Program. Custom Comput. form for scalable embedded computing,’’ in Proc. 6th Medit. Conf.
Mach. (FCCM), Apr. 2007, pp. 127–140. [Online]. Available: Embedded Comput. (MECO), Jun. 2017, pp. 1–4. http://ieeexplore.ieee.
https://ieeexplore.ieee.org/document/4297250, doi: 10.1109/FCCM. org/document/7977173/, doi: 10.1109/MECO.2017.7977117.
2007.62. [101] C. Álvarez et al., ‘‘The AXIOM software layers,’’ Microprocessors
[82] A. G. Schmidt, W. V. Kritikos, S. Datta, and R. Sass, ‘‘Reconfigurable Microsyst., vol. 47, pp. 262–277, Nov. 2016, doi: 10.1016/J.MICPRO.
computing cluster project: Phase I brief,’’ in Proc. 16th Int. Symp. 2016.07.002.
Field-Program. Custom Comput. Mach., Apr. 2008, pp. 300–301, doi: [102] R. Giorgi, M. Procaccini, and F. Khalili, ‘‘AXIOM: A scalable,
10.1109/FCCM.2008.12. efficient and reconfigurable embedded platform,’’ in Proc. Design,
[83] AMD Xilinx. (Oct. 2022). Aurora 64B/66B LogiCORE IP Prod- Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2019, pp. 480–485, doi:
uct Guide. [Online]. Available: https://docs.xilinx.com/r/en-US/pg074- 10.23919/DATE.2019.8715168.
aurora-64b66b [103] A. Filgueras, M. Vidal, M. Mateu, D. Jiménez-González, C. Alvarez,
[84] R. G. Jaganathan, K. D. Underwood, and R. Sass, ‘‘A configurable X. Martorell, E. Ayguadé, D. Theodoropoulos, D. Pnevmatikatos, P. Gai,
network protocol for cluster based communications using modular S. Garzarella, D. Oro, J. Hernando, N. Bettin, A. Pomella, M. Procaccini,
hardware primitives on an intelligent NIC,’’ in Proc. ACM/IEEE and R. Giorgi, ‘‘The AXIOM project: IoT on heterogeneous embedded
Conf. Supercomput., Nov. 2003, p. 22, doi: 10.1145/1048935. platforms,’’ IEEE Design Test, vol. 38, no. 5, pp. 74–81, Oct. 2021, doi:
1050173. 10.1109/MDAT.2019.2952335.
[85] HPC Open. (2022). Open MPI: Open Source High Performance Comput- [104] AMD-Xilinx. (2021). Xilinx Adaptive Compute Clusters (XACC)
ing. [Online]. Available: https://www.open-mpi.org/ Academia-Industry Research Ecosystem | HACC Resources. [Online].
[86] K. Datta and R. Sass, ‘‘RBoot: Software infrastructure for a remote Available: https://www.amd-haccs.io/adapt_2021.html
FPGA laboratory,’’ in Proc. 15th Annu. IEEE Symp. Field-Program. [105] (2016). Heterogeneous Accelerated Compute Clusters | HACC
Custom Comput. Mach. (FCCM ), Apr. 2007, pp. 343–344, doi: Resources. [Online]. Available: https://www.amd-haccs.io/index.html
10.1109/FCCM.2007.53. [106] T. Prickett. (2018). Forging a Hybrid CPU-FPGA Supercomputer.
[87] Staff. (Jul. 2005). FPGA High-Performance Computing Alliance [Online]. Available: https://www.nextplatform.com/2018/09/25/forging-
(FHPCA). [Online]. Available: http://www.fhpca.org a-hybrid-cpu-fpga-supercomputer/

VOLUME 11, 2023 67703


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

[107] Paderborn Center for Parallel Computing (PC2). (2022). PC2— [124] D. A. Patterson, ‘‘RAMP: Research accelerator for multiple
Noctua 2 (Universität Paderborn). [Online]. Available: https://pc2.uni- processors—A community vision for a shared experimental
paderborn.de/hpc-services/available-systems/noctua2 parallel HW/SW platform,’’ in Proc. IEEE Int. Symp. Perform.
[108] Intel. (2022). OneAPI: A New Era of Heterogeneous Computing. Anal. Syst. Softw., Mar. 2006, p. 1, doi: 10.1109/ISPASS.2006.
[Online]. Available: https://www.intel.com/content/www/us/en/ 1620784.
developer/tools/oneapi/overview.html [125] Wirbel Loring. (May 2010). Berkeley Emulation Engine Update—EDN.
[109] D. Cock, A. Ramdas, D. Schwyn, M. Giardino, A. Turowski, Z. He, [Online]. Available: https://www.edn.com/berkeley-emulation-engine-
N. Hossle, D. Korolija, M. Licciardello, K. Martsenko, R. Achermann, update/
G. Alonso, and T. Roscoe, ‘‘Enzian: An open, general, CPU/FPGA [126] J. Rothman and C. Chang, ‘‘BEE technology overview,’’ in Proc. Int.
platform for systems software research,’’ in Proc. 27th ACM Int. Conf. Conf. Embedded Comput. Syst. (SAMOS). Samos, Greece: Institute
Architectural Support Program. Lang. Operating Syst., Feb. 2022, p. 18, of Electrical and Electronics Engineers, Jan. 2013, p. 277, doi:
doi: 10.1145/3503222.3507742. 10.1109/SAMOS.2012.6404186.
[110] A. D. Ioannou, K. Georgopoulos, P. Malakonakis, D. N. Pnevmatikatos, [127] EDN. (Jun. 2010). DESIGN TOOLS—BEEcube Launches BEE4, a Full-
V. D. Papaefstathiou, I. Papaefstathiou, and I. Mavroidis, ‘‘UNILOGIC: Speed FPGA Prototyping Platform—EDN. [Online]. Available: https://
A novel architecture for highly parallel reconfigurable systems,’’ ACM www.edn.com/design-tools-beecube-launches-bee4-a-full-speed-fpga-
Trans. Reconfigurable Technol. Syst., vol. 13, no. 4, pp. 1–32, Dec. 2020, prototyping-platform/
doi: 10.1145/3409115. [128] M. Lin, ‘‘Hardware-assisted large-scale neuroevolution for multiagent
[111] Cygnus Consortium. (2018). About Cygnus. [Online]. Available: https:// learning,’’ Dept. Elect. Comput. Eng., Univ. Central Florida, Orlando,
www.ccs.tsukuba.ac.jp/wp-content/uploads/sites/14/2018/12/About- FL, USA, Dec. 2014. [Online]. Available: https://apps.dtic.mil/sti/
Cygnus.pdf citations/ADA621804
[112] T. Boku, N. Fujita, R. Kobayashi, and O. Tatebe, ‘‘Cygnus—World first [129] I. Sokol. (Apr. 2015). NIs BEEcube Acquisition Drives 5G Communi-
multihybrid accelerated cluster with GPU and FPGA coupling,’’ in Proc. cations | Microwaves & RF. [Online]. Available: https://www.mwrf.
ICPP Workshops. New York, NY, USA: Association for Computing com/technologies/systems/article/21846169/nis-beecube-acquisition-
Machinery, Aug. 2022, p. 1, doi: 10.1145/3547276.3548629. drives-5g-communications
[113] K. Kikuchi, N. Fujita, R. Kobayashi, and T. Boku, ‘‘Implementation [130] National Instruments. (2022). What is FlexRIO?—NI. [Online].
and performance evaluation of collective communications using CIRCUS Available: https://www.ni.com/it-it/shop/electronic-test-instrumentation/
on multiple FPGAs,’’ in Proc. HPC Asia Workshops. New York, NY, flexrio/what-is-flexrio.html
USA: Association for Computing Machinery, Feb. 2023, p. 1523, doi: [131] L. Bonati, P. Johari, M. Polese, S. D’Oro, S. Mohanti,
10.1145/3581576.3581602. M. Tehrani-Moayyed, D. Villa, S. Shrivastava, C. Tassie, K. Yoder,
[114] RIKEN Center for Computational Science. (2020). Fugaku: Riken’s A. Bagga, P. Patel, V. Petkov, M. Seltser, F. Restuccia, A. Gosain,
Flagship Supercomputer. [Online]. Available: https://www.fugaku- K. R. Chowdhury, S. Basagni, and T. Melodia, ‘‘Colosseum: Large-
riken.jp/ scale wireless experimentation through hardware-in-the-loop network
[115] K. Sano, A. Koshiba, T. Miyajima, and T. Ueno, ‘‘ESSPER: Elastic emulation,’’ in Proc. IEEE Int. Symp. Dyn. Spectr. Access Netw.
and scalable FPGA-cluster system for high-performance reconfigurable (DySPAN), Dec. 2021, pp. 105–113, doi: 10.1109/DYSPAN53946.2021.
computing with supercomputer Fugaku,’’ in Proc. Int. Conf. High 9677430.
Perform. Comput. Asia–Pacific Region (HPC Asia). New York, NY, [132] Ettus. (2014). USRP Hardware Driver and USRP Manual: USRP
USA: Association for Computing Machinery, 2023, pp. 140–150, doi: X3x0 Series. [Online]. Available: https://files.ettus.com/manual/page_
10.1145/3578178.3579341. usrp_x3x0.html
[116] J. Davis et al., ‘‘BEE3: Revitalizing computer architecture research,’’ [133] NI. (2022). ATCA Overview—NI. [Online]. Available: https://www.
Microsoft, Apr. 2009. [Online]. Available: https://www.microsoft.com/ ni.com/docs/en-US/bundle/atca-3671-getting-started/page/overview.
en-us/research/publication/bee3-revitalizing-computer-architecture- html
research/ [134] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber,
[117] K. Kuusilinna, C. Chang, M. J. Ammer, B. C. Richards, and ‘‘HyperX: Topology, routing, and packaging of efficient large-scale net-
R. W. Brodersen, ‘‘Designing BEE: A hardware emulation engine for works,’’ in Proc. Conf. High Perform. Comput. Netw., Storage Anal.
signal processing in low-power wireless applications,’’ EURASIP J. Adv. (SC), New York, NY, USA, 2009, p. 1. [Online]. Available: http://
Signal Process., vol. 2003, no. 6, pp. 502–513, Dec. 2003. [Online]. dl.acm.org/citation.cfm?doid=1654059.1654101, doi: 10.1145/1654059.
Available: https://www.mathworks.com 1654101.
[118] S. C. Jain, S. Kumar, and A. Kumar, ‘‘Evaluation of various rout- [135] S. Gupta et al. (2022). Getting Started With RFNoC in UHD 4.0—
ing architectures for multi-FPGA boards,’’ in Proc. VLSI Design Ettus Knowledge Base. [Online]. Available: https://kb.ettus.com/Getting_
Wireless Digit. Imag. Millennium 13th Int. Conf. VLSI Design. Started_with_RFNoC_in_UHD_4.0
Washington, DC, USA: IEEE Computer Society, 2000, pp. 262–267. [136] A. Chaudhari and M. Braun, ‘‘A scalable FPGA architecture for flexible,
[Online]. Available: http://ieeexplore.ieee.org/document/812619/, doi: large-scale, real-time RF channel emulation,’’ in Proc. 13th Int. Symp.
10.1109/ICVD.2000.812619. Reconfigurable Commun.-Centric Syst.-on-Chip (ReCoSoC), Jul. 2018,
[119] C. Chang, K. Kuusilinna, B. Richards, and R. W. Brodersen, ‘‘Imple- pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/
mentation of BEE: A real-time large-scale hardware emulation engine,’’ 8449390/, doi: 10.1109/ReCoSoC.2018.8449390.
in Proc. ACM/SIGDA 11th Int. Symp. Field Program. Gate Arrays, [137] J. J. Dongarra and A. J. van der Steen, ‘‘High-performance computing
Feb. 2003, pp. 91–99, doi: 10.1145/611817.611832. systems: Status and outlook,’’ Acta Numerica, vol. 21, pp. 379–474,
[120] C. Chang, J. Wawrzynek, and R. W. Brodersen, ‘‘BEE2: A high-end May 2012, doi: 10.1017/S0962492912000050.
reconfigurable computing system,’’ IEEE Design Test Comput., vol. 22, [138] L. M. Al Qassem, T. Stouraitis, E. Damiani, and I. M. Elfadel,
no. 2, pp. 114–125, Feb. 2005, doi: 10.1109/MDT.2005.30. ‘‘FPGAaaS: A survey of infrastructures and systems,’’ IEEE Trans. Ser-
[121] A. G. Schmidt, B. Huang, R. Sass, and M. French, ‘‘Check- vices Comput., vol. 15, no. 2, pp. 1143–1156, Mar. 2022, doi: 10.1109/
point/restart and beyond: Resilient high performance computing with TSC.2020.2976012.
FPGAs,’’ in Proc. IEEE 19th Annu. Int. Symp. Field-Program. Cus- [139] A. George, H. Lam, A. Lawande, C. Pascoe, and G. Stitt, ‘‘Novo-
tom Comput. Mach., May 2011, pp. 162–169, doi: 10.1109/FCCM. G: A view at the HPC crossroads for scientific computing,’’ in Proc.
2011.22. ERSA, 2010, pp. 21–30. [Online]. Available: http://plaza.ufl.edu/poppyc/
[122] S. Buscemi and R. Sass, ‘‘Design and utilization of an FPGA cluster ERS5029.pdf
to implement a digital wireless channel emulator,’’ in Proc. 22nd Int. [140] D. Gratadour et al., ‘‘Prototyping AO RTC using emerging high
Conf. Field Program. Log. Appl. (FPL), Aug. 2012, pp. 635–638, doi: performance computing technologies with the green flash project,’’ Proc.
10.1109/FPL.2012.6339253. SPIE, vol. 10703, pp. 404–418, Jul. 2018. [Online]. Available: https://
[123] S. Buscemi and R. Sass, ‘‘Design of a scalable digital wireless chan- www.spiedigitallibrary.org/conference-proceedings-of-spie/10703/1070
nel emulator for networking radios,’’ in Proc. Mil. Commun. Conf., 318/Prototyping-AO-RTC-using-emerging-high-performance-computin
Nov. 2011, pp. 1858–1863. [Online]. Available: http://ieeexplore.ieee. g-technologies-with/10.1117/12.2312686.full%20, doi: 10.1117/12.
org/document/6127583/, doi: 10.1109/MILCOM.2011.6127583. 2312686.

67704 VOLUME 11, 2023


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

[141] A. Mondigo, T. Ueno, K. Sano, and H. Takizawa, ‘‘Comparison [157] M. A. Zapletina and D. A. Zheleznikov, ‘‘The acceleration tech-
of direct and indirect networks for high-performance FPGA clus- niques for the modified pathfinder routing algorithm on an island-
ters,’’ in Applied Reconfigurable Computing. Architectures, Tools, and style FPGA,’’ in Proc. Conf. Russian Young Res. Electr. Electron.
Applications (Lecture Notes in Computer Science: Including Sub- Eng. (ElConRus), Jan. 2022, pp. 920–923. [Online]. Available: https://
series Lecture Notes in Artificial Intelligence and Lecture Notes in ieeexplore.ieee.org/document/9755536/, doi: 10.1109/ElConRus54750.
Bioinformatics), vol. 12083. Springer, 2020, pp. 314–329. [Online]. 2022.9755536.
Available: http://link.springer.com/10.1007/978-3-030-44534-8_24, doi: [158] A. Vaishnav, K. D. Pham, and D. Koch, ‘‘A survey on FPGA
10.1007/978-3-030-44534-8_24. virtualization,’’ in Proc. 28th Int. Conf. Field Program. Log. Appl.
[142] J. D. D. Gazzano, M. L. Crespo, A. Cicuttin, and F. R. Calle, (FPL). Piscataway, NJ, USA: Institute of Electrical and Electronics
Field-Programmable Gate Array (FPGA) Technologies for High Perfor- Engineers, Aug. 2018, pp. 131–138, doi: 10.1109/FPL.2018.
mance Instrumentation. Hershey, PA, USA: IGI Global, Jul. 2016, doi: 00031.
10.4018/978-1-5225-0299-9. [159] K. Fleming, H. Yang, M. Adler, and J. Emer, ‘‘The LEAP FPGA
[143] J. P. Orellana, M. B. Caminero, and C. Carrión, ‘‘Diseño de una arqui- operating system,’’ in Proc. 24th Int. Conf. Field Program.
tectura heterogénea para la gestión eficiente de recursos FPGA en un Log. Appl. (FPL), Sep. 2014, pp. 1–8, doi: 10.1109/FPL.2014.
cloud privado,’’ in Aplicaciones e Innovación de la Ingeniería en Cien- 6927488.
cia y Tecnología. Quito, Ecuador: Abya-Yala, 2019, pp. 165–199, doi: [160] L. Clausing and M. Platzner, ‘‘ReconOS64: A hardware oper-
10.7476/9789978104910.0007. ating system for modern platform FPGAs with 64-bit support,’’
[144] M. Southworth. (Oct. 2021). Choosing the best processor for the job. in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops
Curtis-Wright. [Online]. Available: https://www.curtisswrightds.com/ (IPDPSW), May 2022, pp. 120–127, doi: 10.1109/IPDPSW55747.2022.
sites/default/files/2021-10/Choosing-the-Best-Processor-for-the-Job- 00029.
white-paper.pdf [161] D. Korolija, T. Roscoe, and G. Alonso, ‘‘Do OS abstractions make
[145] M. Qasaimeh, K. Denolf, J. Lo, K. Vissers, J. Zambreno, and P. H. Jones, sense on FPGAs?’’ in Proc. 14th USENIX Symp. Operating Syst.
‘‘Comparing energy efficiency of CPU, GPU and FPGA implementations Design Implement., 2020, pp. 991–1010. [Online]. Available: https://
for vision kernels,’’ in Proc. IEEE Int. Conf. Embedded Softw. Syst. www.usenix.org/conference/osdi20/presentation/roscoe, doi: 10.5555/
(ICESS), Jun. 2019, pp. 1–8, doi: 10.1109/ICESS.2019.8782524. 3488766.3488822.
[146] A. Cicuttin, M. L. Crespo, K. S. Mannatunga, J. G. Samarawickrama, [162] S. Möller et al., ‘‘Community-driven development for computational
N. Abdallah, and P. B. Sabet, ‘‘HyperFPGA: A possible general purpose biology at sprints, hackathons and codefests,’’ BMC Bioinf., vol. 15,
reconfigurable hardware for custom supercomputing,’’ in Proc. Int. Conf. Dec. 2014, Art. no. S7, doi: 10.1186/1471-2105-15-S14-S7.
Adv. Electr., Electron. Syst. Eng. (ICAEES), Nov. 2016, pp. 21–26, doi: [163] M. Pathan et al., ‘‘A novel community driven software for func-
10.1109/ICAEES.2016.7888002. tional enrichment analysis of extracellular vesicles data,’’ J. Extra-
[147] A. Tomori and Y. Osana, ‘‘Kyokko: A vendor-independent high- cellular Vesicles, vol. 6, no. 1, Dec. 2017, Art. no. 1321455, doi:
speed serial communication controller,’’ in Proc. 11th Int. Symp. 10.1080/20013078.2017.1321455.
Highly Efficient Accel. Reconfigurable Technol. New York, NY, USA: [164] M. Kühbach, A. J. London, J. Wang, D. K. Schreiber, F. M. Martin,
Association for Computing Machinery, Jun. 2021, pp. 1–6. [Online]. I. Ghamarian, H. Bilal, and A. V. Ceguerra, ‘‘Community-driven
Available: https://doi-org.ezproxy.cern.ch/10.1145/3468044.3468051, methods for open and reproducible software tools for analyzing datasets
doi: 10.1145/3468044.3468051. from atom probe microscopy,’’ Microsc. Microanal., vol. 28, no. 4,
[148] T. Ueno and K. Sano, ‘‘VCSN: Virtual circuit-switching network for pp. 1038–1053, Aug. 2022. [Online]. Available: https://www.cambridge.
flexible and simple-to-operate communication in HPC FPGA cluster,’’ org/core/product/identifier/S1431927621012241/type/journal_article,
ACM Trans. Reconfigurable Technol. Syst., vol. 16, no. 2, pp. 1–32, doi: 10.1017/S1431927621012241.
Jun. 2023, doi: 10.1145/3579848. [165] R. D. Chamberlain, M. A. Franklin, E. J. Tyson, J. H. Buckley, J. Buh-
[149] T. El-Ghazawi et al., ‘‘Exploration of a research roadmap for ler, G. Galloway, S. Gayen, M. Hall, E. F. B. Shands, and N. Singla,
application development and execution on field-programmable gate ‘‘Auto-pipe: Streaming applications on architecturally diverse systems,’’
array (FPGA)-based systems,’’ George Washington Univ., Washington, Computer, vol. 43, no. 3, pp. 42–49, Mar. 2010, doi: 10.1109/MC.
DC, USA, Tech. Rep. ADA494473, Oct. 2008. [Online]. Available: 2010.62.
https://apps.dtic.mil/sti/citations/ADA494473 [166] Y. Osana, T. Imahigashi, and A. Tomori, ‘‘OpenFC: A portable toolkit
[150] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis, for custom FPGA accelerators and clusters,’’ in Proc. 8th Int. Symp.
J. Wawrzynek, and K. Asanovic, ‘‘Chisel: Constructing hardware in Comput. Netw. Workshops (CANDARW), Nov. 2020, pp. 185–190, doi:
a scala embedded language,’’ in Proc. Design Autom. Conf., 2012, 10.1109/CANDARW51189.2020.00045.
pp. 1216–1225, doi: 10.1145/2228360.2228584. [167] G. Chirkov and D. Wentzlaff, ‘‘SMAPPIC: Scalable multi-FPGA archi-
[151] A. Izraelevitz, J. Koenig, P. Li, R. Lin, A. Wang, A. Magyar, D. Kim, tecture prototype platform in the cloud,’’ in Proc. 28th ACM Int. Conf.
C. Schmidt, C. Markley, J. Lawson, and J. Bachrach, ‘‘Reusabil- Architectural Support Program. Lang. Operating Syst. New York, NY,
ity is FIRRTL ground: Hardware construction languages, compiler USA: Association for Computing Machinery, Jan. 2023, pp. 733–746,
frameworks, and transformations,’’ in IEEE/ACM Int. Conf. Comput.- doi: 10.1145/3575693.3575753.
Aided Design Dig. Tech. Papers. Piscataway, NJ, USA: Institute of
Electrical and Electronics Engineers, Nov. 2017, pp. 209–216, doi:
10.1109/ICCAD.2017.8203780.
[152] C. Baaij, ‘‘CλasH: From Haskell to hardware,’’ Fac. EEMCS. Com-
put. Archit. Embedded Syst., Univ. Twente, Enschede, The Netherlands,
Dec. 2009.
[153] M. Kooijman, ‘‘Haskell as a higher order structural hardware descrip-
tion language,’’ Fac. EEMCS, Comput. Archit. Embedded Syst., Univ.
Twente, Enschede, The Netherlands, Dec. 2009. [Online]. Available:
http://essay.utwente.nl/59381/
[154] M. Mariotti, D. Magalotti, D. Spiga, and L. Storchi, ‘‘The bondmachine, a
WERNER FLORIAN SAMAYOA received the
moldable computer architecture,’’ Parallel Comput., vol. 109, Mar. 2022,
B.S. degree in electronics engineering from the
Art. no. 102873, doi: 10.1016/J.PARCO.2021.102873.
[155] R. S. Molina, V. Gil-Costa, M. L. Crespo, and G. Ramponi,
University of San Carlos, Guatemala, in 2018.
‘‘High-level synthesis hardware design for FPGA-based accelerators: He is currently pursuing the Ph.D. degree
Models, methodologies, and frameworks,’’ IEEE Access, vol. 10, in industrial and information engineering with
pp. 90429–90455, 2022, doi: 10.1109/ACCESS.2022.3201107. the Multidisciplinary Laboratory (MLab), The
[156] Y. Zhou, D. Vercruyce, and D. Stroobandt, ‘‘Accelerating FPGA routing Abdus Salam International Center for Theoretical
through algorithmic enhancements and connection-aware paralleliza- Physics, Universitã degli Studi di Trieste, under the
tion,’’ ACM Trans. Reconfigurable Technol. Syst., vol. 13, no. 4, pp. 1–26, Joint-Supervision Program. His research interest
Dec. 2020, doi: 10.1145/3406959. includes scalable reconfigurable supercomputing.

VOLUME 11, 2023 67705


W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures

MARIA LIZ CRESPO is currently a Research Offi- SERGIO CARRATO received the master’s degree
cer with The Abdus Salam International Centre in electronic engineering and the Ph.D. degree in
for Theoretical Physics (ICTP) and an Associate signal processing from the University of Trieste,
Researcher with the Italian National Institute of Trieste, Italy. Then, he was with Ansaldo Com-
Nuclear Physics (INFN), Trieste, Italy. She is also ponenti and Sincrotrone Trieste in the field of
coordinating the Research and Training Program, electronic instrumentation for applied physics.
Multidisciplinary Laboratory (MLab), ICTP. She He joined the Department of Electronics, Univer-
has organized several international schools and sity of Trieste, where he is currently an Associate
workshops on fully programmable systems on chip Professor in electronic devices.
for nuclear and scientific instrumentation. She
is the coauthor of more than 100 scientific publications in prestigious
peer-reviewed journals. Her research interests include advanced scientific
instrumentation for particle physics experiments and experimental multidis-
ciplinary research.

ANDRES CICUTTIN received the degree in


physics from the National University of La Plata,
Argentina, in 1992, and the Laurea degree in fisica
from the University of Trieste, Italy, in 1993.
He is currently a Technical Assistant with the
Multidisciplinary Laboratory, The Abdus Salam
International Centre for Theoretical Physics, and
an Associate Researcher with the Italian National
Institute for Nuclear Physics (INFN). He has
organized and directed numerous international
workshops on programmable logic devices for scientific instrumentation and
high education.

Open Access funding provided by ‘Università degli Studi di Trieste’ within the CRUI CARE Agreement

67706 VOLUME 11, 2023

You might also like