A_Survey_on_FPGA-Based_Heterogeneous_Clusters_Architectures
A_Survey_on_FPGA-Based_Heterogeneous_Clusters_Architectures
ABSTRACT In recent years, the most powerful supercomputers have already reached megawatt power con-
sumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining
this trend. To this date, the prevalent approach to supercomputing is dominated by CPUs and GPUs. Given
their fixed architectures with generic instruction sets, they have been favored with lots of tools and mature
workflows which led to mass adoption and further growth. However, reconfigurable hardware such as FPGAs
has repeatedly proven that it offers substantial advantages over this supercomputing approach concerning
performance and power consumption. In this survey, we review the most relevant works that advanced the
field of heterogeneous supercomputing using FPGAs focusing on their architectural characteristics. Each
work was divided into three main parts: network, hardware, and software tools. All implementations face
challenges that involve all three parts. These dependencies result in compromises that designers must take
into account. The advantages and limitations of each approach are discussed and compared in detail. The
classification and study of the architectures illustrate the trade-offs of the solutions and help identify open
problems and research lines.
The notion of reconfigurable hardware has been present using a Hardware Description Language (HDL), such as
since 1984, when Altera delivered the first programmable VHDL or Verilog. The HDL description is then synthesized
logic device (PLD) to the industry [1]. Then, in 1985 into a netlist that is mapped onto the FPGA’s logic ele-
Ross Freeman and Bernard Vonderschmitt patented the ments and interconnections required to implement the desired
first commercially viable field-programmable gate array digital design. The final implementation in the FPGA is
(FPGA) [2]. Owing to production costs, when compared to performed using vendor-specific tools such as Vivado [3],
application-specific integrated circuits (ASICs), FPGAs are Vitis [4], Quartus [5], and Libero [6]. Once the mapping and
traditionally used in applications with low production vol- routing process is completed, the design is compiled into a
umes that require high throughput and low latency. bitstream file loaded onto the FPGA to configure its logic
FPGAs are electronic devices that consist of many config- elements and interconnections to create a circuit correspond-
urable logic blocks composed of look-up tables, flip-flops, ing to the algorithm. It has to be added, however, proprietary
I/O blocks, and interconnection fabric. FPGAs are used to FPGA vendor tools have dominated the field, there are now
create custom hardware solutions, which make the imple- some open-source FPGA tools, such as Yosys [7], F4PGA [8],
mentation of algorithms quite different from targeting a CPU. and RapidSilicon [9], that provide alternative options for
The initial step typically consists of describing the algorithm developers seeking open-source solutions.
FPGAs have evolved into more complex devices [10]
The associate editor coordinating the review of this manuscript and by integrating components, such as embedded memory
approving it for publication was Vincenzo Conti . resources, clock management units, digital signal processing
TABLE 1. The 13 dwarfs of Berkeley [16], where each one represents an and development time. The preference for FPGAs is due
algorithmic method encapsulating patterns of communication and/or
computation with example problems. to their reconfigurability, which allows extreme hardware
specialization when needed. In addition, the fact that FPGAs
offer a wide array of input-output ports makes them ideal for
stream computation and for creating pipe-lined systems that
can maintain high throughput with low latency.
The purpose of this survey is to demonstrate and ana-
lyze the challenges of heterogeneous supercomputing by
studying the most relevant implementations of FPGA-based
cluster architectures from different application fields. Each
studied platform provides valuable insight into the decisions
and tradeoffs developers have made to reach their specific
goals. By leveraging their experience, it will be possible to
visualize the evolution and present trends in FPGA-based
clusters and target the main open challenges. We propose
dividing the architectural components of each cluster into
network, hardware, and software tools. This division helps
identify and discuss the pros and cons of each component in
its corresponding domain.
The main contributions of this study are as follows:
1) The comprehensive study of the state-of-the-art of
FPGA-based clusters.
2) A three-way segmentation of the clusters’ architecture.
3) A critical discussion of the components that build up
blocks (DSP), network-on-chip (NoC), and CPUs [11]. These
the studied clusters.
hybrid devices are known as system-on-chip (SoC-FPGA)
or adaptive SoCs, depending on the vendor. Their increased In the context of this paper, we describe a cluster by its
capabilities have increased interest in specific applications computational units (CU), which correspond to its small-
and general purposes [12], [13]. est independent part and sometimes coincide with a single
As a reconfigurable device, FPGA offers the advantage of network node. Each CU can be composed of several compu-
continuous improvement in hardware and software. In fact, tational elements (CE), namely CPUs, GPUs, and FPGAs.
being able to change the architecture offers great freedom The remainder of this paper is organized as follows.
when developing complex systems. Furthermore, FPGAs Section I elaborates on the implementations and explores rel-
have been shown to consume considerably less power than evant advancements in their application fields. A table at the
CPUs and GPUs [14], leading to reduced cooling and energy end of each application field discussion summarizes the main
costs. contributions of each study, along with the reported perfor-
By studying computing problems, classification based on mance and energy improvements, when available. Significant
repeating algorithmic patterns was proposed in 2004 [15]. In differences can be understood by studying the evolution of
2006, [16] 6 new algorithmic encapsulations were defined, heterogeneous clusters within each niche. Section II presents
expanding the classification to 13 dwarfs as shown in Table 1. the classification of systems from an architectural perspec-
Theoretically, each dwarf can be mapped onto a specific com- tive. The three main aspects described in each implementation
puting architecture [17], [18]. This has inspired the creation were used as comparison points. Subsection II-A presents
of benchmarks for heterogeneous systems such as Dwarf- a comparison of the network infrastructure in the studies.
Bench [19], Rodinia [20], and OpenDwarfs [21]. The hardware available in each studied cluster is discussed
Several implementations of heterogeneous high- in Subsection II-B. To complete the classification discus-
performance computing (HPC) systems housing FPGAs sion, the developer tools are compared in Subsection II-C.
can be named, such as Project Catapult at Microsoft [22], In Section III we present the remaining open problems and
Alibaba FaaS (FPGA as a Service) [23], Amazon EC2 F1 the identified trends. To close this paper, Section IV draws
instances [24], and ARUZ cluster at Lodz University [25]. conclusions.
At CERN, the massive adoption of FPGAs for online data
processing has motivated the development and adoption of I. CLUSTER IMPLEMENTATIONS
specific tools to aid the development of applications based Different FPGA-based cluster implementations were stud-
on FPGAs, such as hls4ml [26] (high-level synthesis for ied, and their specific characteristics highlight the purpose
machine learning). This tool, along with many others [27], for which they were planned. Technological advances offer
[28], [29], [30], [31], allows for a higher level of abstrac- greater flexibility, and cost reduction opens the door to
tion, thereby significantly reducing implementation errors increasing complexity. It can be appreciated that there is a
A. MANYCORE EMULATION
The development of manycore platforms is a long and expen-
sive process that involves several stages of experimentation,
validation, and integration. There are software tools that help
simulate architectures for easy parameter tuning, with the
major drawback of speed. In this particular aspect, FPGA
prototyping allows faster execution times and benefits from
insights from real hardware. It is not rare for a complete
platform to exceed the logic available in a single FPGA,
FIGURE 1. Fast [33] computational unit (CU) with the computing tiles in
pushing for a cluster of FPGAs. orange and the FPGA hub in purple.
This was the case since 1997, when one of the first FPGA
clusters was used to emulate the RAW architecture [32].
The RAW cluster consisted of 5 boards or CUs, each with interconnect express (PCIe) 2.0 × 8 for the host connection
64 FPGAs, totaling 320 FPGAs. Its results showed orders of to configure and manage up to 4 DB-V5 (daughter board
magnitude speed-up compared to contemporaneous scalable version 5).
processors with the disadvantages of reduced flexibility, high Figure 2 shows the RAPTOR-Xpress baseboard with
cost, and high implementation complexity, which hindered 4 DBs interfaced directly with their neighbors in a ring
their adoption in other research applications. topology. Each has a Xilinx Virtex-5 FPGA with up to 4
In 2006, the FAST [33] cluster was presented to bring hard- GB of DDR3 memory and a dedicated FPGA as a PCIe
ware back into the research cycle to address the disadvantages interface Multiple baseboards can be connected together via
of RAW. FAST combined dedicated microprocessor chips and 4 high-speed connectors, each consisting of 21 full-duplex
static random access memories (SRAM) with FPGAs into a serial lanes, enabling scaling resources beyond the 4 DB on
heterogeneous hybrid solution to simulate chip multiproces- board. The baseboards can also be interfaced with the host
sor architectures. The vision was to reduce hardware costs via dedicated FPGAs on Nallatech front-side bus acceleration
and ease development, both for programming and portability. modules [36], which provides an extra 8.5 Gb/s for writing
Each FAST CU consisted of 8 processors, 10 Xilinx Virtex and 5.6 Gb/s for reading.
FPGAs, and 4 memory-interconnected tiles. The 2 processors The RAPTOR project also comprises a custom soft-
in each tile acted as the CPU and floating processing unit, ware development environment that includes RAPTORLIB,
respectively, and 2 FPGAs acted as the level-one memory RAPBTORAPI, and RAPTORGUI tools, which aid devel-
controller and coprocessor. opers by providing hardware-supported protocols, remote
A central hub, made up of 2 FPGAs, was used to manage access, and a graphical user interface to facilitate testing. The
shared resources and orchestrate communication between design flow includes aids for design partitioning, which is
tiles allowing access to off-the-board devices through exter- a manual process assisted by a graphical integrated devel-
nal IOs. Additionally, the expansion connector available to opment environment (IDE) and standard synthesis tools
the FPGA hub allows multiple FAST CUs to be connected. developed in vMAGIC [37].
The CU implementation is illustrated in Figure 1. Convinced by the need for cheaper and smaller hardware,
A custom software stack was developed specifically for the Formic cluster [38] based on the Formic board [39]
FAST. It included several modules and predefined interfaces was presented in 2014. The Formic board acts as the build-
for functionality and benchmarking. An operating system was ing block for a larger system, with a maximum size of
developed to manage control tasks such as programming and 4096 boards. Each board consists of an FPGA, SRAM,
configuration. Portability was demonstrated by implementing 1 GB of double data rate (DDR) RAM, a power supply,
several architectures; however, scalability and costs remained buffered joint test action group (JTAG) connectors, and
open to discussion. configuration memory, making it independent and perfectly
Similar to FAST, the RAPTOR cluster was presented as a symmetric. Eight multi-gigabit transceivers (MGT) at a max-
baseboard hosting up to 4 daughter cards based on complex imum speed of 3 Gb/s are available for interconnection on
programmable logic devices (CPLD) [34]. In 2010, a second 8 serial advanced technology attachment (SATA) connec-
version was presented using FPGAs and a renewed architec- tors. Inside each board, a full NoC with a 22 port crossbar
ture [35]. This new version consisted of a RAPTOR-Xpress switch interfaces the configured blocks with MGT links and
baseboard (CU) that provides two buses for Gigabit Ethernet, allows developers to scale the designs. Access to local and
universal serial bus (USB) 2.0, and peripheral component remote memories is done using the Remote Direct Memory
FIGURE 2. Simplified diagram of the RAPTOR-Xpress board [35] or computational unit (CU) with the
daughter boards in orange.
Access (RDMA) protocol [40]. As the first application, The core of Janus comprised an array of 4 by 4
a multicore system based on 8 custom MicroBlaze [41] FPGA-based simulation processors (SP) which were con-
processors per module forming a 512-core cluster [42] was nected with their nearest neighbors. Another processing unit
implemented. called an Input/Output processor (IOP), acted as a crossbar
Simultaneously, the industry has produced exciting devel- and was in charge of managing communications between
opments in manycore emulation. In an attempt to reduce the FPGAs and the host.
time to market for new ICs, Cadence [43] and Siemens [44], A two-layer software stack was created to help developers
together with others, developed solutions for the prototyping build applications. The firmware layer consisted of a fixed
of ICs. Unfortunately, there is little accessible information part targeting the IOPs, which included a stream router and
regarding the architecture of most implementations, and the dedicated devices to communicate, manage, and program the
high costs make them uncommon in academia, with some SPs. The second layer, the Janus Operating System (JOS),
exceptions, such as the Pico Computing board (now Micron) consisted of the programs running on the host PCs, includ-
used for image processing [45] and the DINI (now Synopsys) ing a set of libraries (JOSlib) to manage the IOP devices,
board FPGA board used for online video processing [46]. a Unix socket application program interfaces (APIs) to inte-
From the described works, it can be seen that there is grate high-level applications and new SP modules, and an
a trend in reducing the complexity of CUs, as shown in interactive shell (JOSH) for debugging and testing.
Figure 3. In this field, costs tend to be the leading factor, mak- In the worst case, Janus performed just 2.4 times faster
ing granularity a desirable characteristic. With smaller CUs, than conventional PCs. Nonetheless, Janus was limited by its
it is possible to reduce the implementation costs, depending performance and scarce memory for some applications [52].
on the requirements of the chip to emulate. Smaller CUs also In parallel, great interest has been shown in the cryptanal-
make it easier for clusters to scale, maintain, and upgrade. ysis field with the development of the COPACOBANA FPGA
cluster [53] in 2006. Figure 4 shows the COPACOBANA
B. SCIENTIFIC COMPUTING cluster which was built over a CU holding up to 20 dual
The complexity of scientific computing problems has always in-line memory modules each with 6 Xilinx Spartan-3 FPGAs
pushed technology to its limit, making computer clusters directly connected to a 64-bit data bus and 16-bit control
a basic requirement. Regardless of whether complex algo- bus. A controller module allowed the host PC to interact via
rithms process huge amounts of data or massive system USB or Ethernet through a software library that provided the
simulations, reconfigurable computing provides the level of necessary functions for the PC to program, store, and read the
customization required by these problems. This did not go status of the cluster as a whole or as individual FPGAs. This
unnoticed, as early as 1991, programmable hardware was made it possible to scale resources by attaching another CU to
already part of custom supercomputers for specific problems the host PC. Its capabilities were demonstrated by testing sev-
like in RTN [47], RASA in 1994 [48], and later in SUE eral encryption algorithms, which resulted in it outperforming
2001 [49]. conventional computers by orders of magnitude [54].
The first massive cluster was created in 2006. Janus [50] The positive outcome of this project motivated the creation
was a massively parallel modular cluster for the simulation of of a hybrid FPGA-GPU cluster [55] based on commercial off-
specific theoretical problems in physics developed by a large the-shelf (COTS) components in 2010. The Cuteforce [56]
collaboration of European institutions [51]. system implemented 15 CUs, 14 with Xilinx Virtex FPGAs,
FIGURE 3. Clusters targeting manycore emulation have shown a trend of reducing the
complexity and increasing the granularity of CUs to favor production costs and
scalability.
and the last with an NVIDIA GPU interconnected through 8U rack boxes, each with 16 boards. The boards in the boxes
a CPU on a CU via Infiniband. The results were not were interconnected through a PCIe to the eSATA board.
as expected, partly because of complications in FPGA A small Linux computer allowed remote programming using
implementation. a USB-to-JTAG converter and a DE2 board as a JTAG fan-out
The same approach was later used in 2010 by Tse, et al. [57] to parallelize the configuration.
who focused on Monte Carlo simulations. However, instead The Bluehive development environment was supported by
of using one CE per CU, a single CU was used to host 2 CPUs, Quartus and mandatory blocks were provided to developers,
an NVIDIA GPU, and a Xilinx FPGA, which was further routers for inter-FPGA communication, FBs, and high-speed
supported by a comprehensive analysis of the performance serial link controllers [61], all developed on Bluespec Sys-
and energy. The network remained practically unchanged temVerilog [62].
from Cuteforce, where the CPUs are the main communication In 2014, Janus received an important upgrade [63], which
CEs and relegate GPUs and FPGAs to an accelerator posi- significantly improved its performance. The architecture
tion. To further demonstrate the scalability of this strategy, remained mostly the same, with the largest change in the
Superdragon [58] was created to accelerate single-particle adoption of newer FPGAs with 8 GB of RAM and MGTs
cryo-electron 3D microscopy. instead of ordinary I/Os for interconnection.
Bluehive [59] also sought to distance itself from custom Janus II and Bluehive were successful in tackling the mem-
PCBs by embracing commodity boards to build a cus- ory issue, but as problems scale, larger clusters were needed.
tom FPGA cluster for scientific simulations and manycore This was the case for ARUZ [25], an application-specific
emulation [60] requiring high-bandwidth and low-latency cluster formed by approximately 26,000 FPGAs distributed
communication. These challenges were overcome with the over 20 panels, each consisting of 12 rows, which in turn con-
development of a 64-node FPGA cluster based on Terasic tained 12 CU. The CUs are composed of eight slave FPGAs
DE4 boards that host an Altera Stratix IV FPGA, an 8xPCIe that constitute the resources and a central master SoC-FPGA
connector, and a DIMM with 4 GB of RAM and interfaced that manages operations. The addition of the Zynq SoC is
through a custom interconnect called BlueLink [61] with four motivated by the higher abstraction level provided by the
TABLE 3. Scientific computing clusters’ contributions, reported power and performance gains.
and a single-event upset logic to reduce system errors that and the added cost of ownership did not exceed the
consume 23% of the FPGA resources, and a role part where limit of 30%. These results show the significant advantage
the computing logic lies. Additionally, a Mapping manager that FPGAs can offer in terms of throughput and power
and health monitor continually scanned each node in the consumption.
network. In case of failure, the faulty node is immediately With the success of Catapult [22], it was only a matter
reconfigured. If the issue persists, the node is flagged for of time before FPGAs were made available for cloud com-
manual intervention, and the mapping manager automatically puting tasks, which is exactly what the IBM cloudFPGA
relocates the services to the available resources. [73] did. Virtualizing the user space makes FPGAs in an
With custom hardware and communication protocol, Infrastructure-as-a-Service (IaaS) environment feasible for
Catapult achieved an improvement of 95% in through- education, research, and testing.
put in a production search infrastructure when compared In the architecture presented, the FPGAs are standalone
to a software-only solution. In addition, the inclusion of nodes in the cluster directly interfaced to the DC via PCIe,
the FPGA increased the power consumption by only 10%, unlike the approach of Amazon [24], Alibaba [23] and IBM
Supervessel [74] which tie the FPGAs to host CPUs. Under Similar to the RCC project, the FPGA High-Performance
this approach, a daughter card consisting of an FPGA and Computing Alliance (FHPCA [87]) was established in
abundant RAM was developed. By creating a custom carrier 2005 with the Maxwell supercomputer [88]. The Maxwell
board, 64 daughter cards can be accommodated in a single 2U CUs were built on a standard IBM BladeCenter chassis,
rack chassis [75]. To achieve the desired homogeneity within in which an Intel Xeon and 2 FPGAs were interfaced
the DC, FPGAs have been provided with a soft network via PCI-X. Additionally, an FPGA-dedicated network is
interface chip, with the advantage of loading only the required available via MGTs without routing logic, given the nearest-
services. neighbor scheme. By supporting standard parallel computing
The multi-FPGA fabric formed by multiple prototypes of a software, structures, and interfaces, it sought to disrupt the
network-attached FPGA was evaluated with a text-analytics HPC space without causing significant friction.
application. The results, compared to a software implemen- To facilitate the development of applications targeting
tation and an implementation accelerated with PCIe-attached Maxwell, the Parallel Toolkit (PTK) [89] was developed.
FPGAs, show that the network-attached FPGAs improved in It included a set of practices and infrastructure to solve
latency and throughput. Additionally, network performance issues such as associating tasks with FPGA resources, seg-
was compared with bare metal servers, virtual machines, and menting the application into bitstreams, and managing code
containers [76] with results showing orders of magnitude bet- dependency. PTK provided a set of libraries where common
ter for the FPGA prototype. To further improve the usability standard interfaces, data structures, and components were
of the platform, continuous developments have been made defined.
to integrate MPI into the system [77], [78]. An in-depth Similarly, Cube was created to explore the scalability of
study of FPGA cloud computing architectures is available a cost-effective massive FPGA experimentation cluster for
in [79] and [80]. real-world applications. It consisted of 8 boards that host
a matrix of 8 by 8 Xilinx FPGAs [90] forming a cluster
of 512 FPGAs, as shown in Figure 5. It features a single
D. GENERAL-PURPOSE CLUSTERS configuration of multiple data-programming paradigms that
Overspecialized systems tend to constrain the potential of allowed all FPGAs to be configured with the same bitstream
reconfigurable hardware in favor of optimizing performance in a matter of seconds. The FPGAs were interconnected
or costs. Nevertheless, general-purpose clusters are addressed in a systolic array that reached up to 3.2 Tb/s inter-FPGA
by a larger group of projects seeking to change the pro- bandwidth offering significant advantages as it simplified the
gramming paradigm. These clusters, rather than being a programming model and greatly relaxed the requirements of
general purpose in the broad sense of the word, serve as the PCB layout.
experimental platforms to test solutions to all heterogeneous Simultaneously, Quadro Plex (QP) [91], a hybrid cluster
supercomputing challenges, ranging from network to user was introduced. It was composed of 16 nodes, each consisting
experience. of one AMD CPU, 8 GB of RAM, 4 NVIDIA Quadro GPUs,
One of the first projects was the Reconfigurable Com- and one Xilinx Virtex 4 Nallatech FPGA accelerator. The
puting Cluster (RCC) [81] in the early 2000s. It was a nodes were interconnected using Ethernet and Infiniband.
multi-institution investigation project that explored the use Cluster communication was managed using OpenFabrics
of FPGAs to build cost-effective petascale computers, with its Enterprise Distribution software stack. The complete system
main contribution being the introduction of microbenchmarks occupied four 42U racks, consumed 18 kW, and had a theo-
for software, network performance, memory bandwidth, and retical performance of 23 TFLOPS. CUDA was used for GPU
power consumption. To evaluate each test Spirit, a cluster development, and the FPGA workflow completely relied on
consisting of 64 FPGA nodes was built. Each node had a the Xilinx ISE design suite [92].
Virtex 4 FPGA with 2 Gigabit Ethernet ports, 8 DIMM Several applications were developed, showing that there
slots for onboard RAM, and 8 MGTs for the board-to-board were substantial difficulties in taking advantage of an entire
interconnection [82] using the Aurora protocol [83]. system. Applications would only use a combination of CPUs
For internode communication, a configurable network and GPUs or CPUs and FPGAs. A framework for easing the
layer core was developed as part of an Adaptable Computing porting of applications and providing a compatibility layer
Cluster project [84]. It consists of a network switch imple- for different accelerator workflows, called Phoenix [93] was
mented in the FPGA acting as a concentrator for the router. developed.
Considering that the head node is a workstation, a message- In the same spirit, Axel [94] was built, consisting of
passing interface (MPI) approach offered the flexibility that 16 nodes. Each node had an AMD CPU, an NVIDIA Tesla
the cross-development environment required. A custom com- GPU, and a Xilinx Virtex 5 FPGA occupying a 4U full-scale
piler based on GNU GCC was built to support OpenMPI and rack. All CEs were connected to a common PCIe bus for
its Modular Component Architecture (MCA) [85] which was intranodal communication and between nodes in a Gigabit
adapted to support the high-speed network. A software infras- Ethernet network. Considering the high latency and nonde-
tructure based on a Linux system allowed users to access, terministic nature of Ethernet, a parallel network using the
manage, and configure all nodes of the cluster via SSH [86]. 4 MGT of the FPGA was also available.
67686 VOLUME 11, 2023
W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures
TABLE 4. Data center FPGA clusters’ contributions reported power and performance improvement.
FIGURE 5. Cube [90] computational unit (CU) showing the configuration controllers in purple. Dotted lines
show the control and configuration bus and solid lines show the data path.
The cluster was managed remotely from the central node for managing GPUs and FPGAs. For this to be feasible, all
using the Torque [95] resource manager and the Maui [96] Axel programs needed to allocate part of the resources in the
scheduler. A custom resource manager (RM) was responsible CEs to interface with the RM runtime API. Using an IPC
An important aspect that most clusters left out, besides to coherence messages, thermal and power monitoring, and
those focused on communication, was the interface with the an open baseboard management controller (BMC) allows
physical world. This is the empty space that the Axiom plat- for research that is not possible in any current commercial
form [100] seeks to fill with a custom scalable cluster based systems.
on a board with a Xilinx MPSoC (Multiprocessor System on Likewise, UNILOGIC [110] presented a new approach,
Chip) supporting the Arduino interface. this time from the management of the cluster by introducing a
The MPSoC has an FPGA fabric, four 64-bit ARM cores Partitioned Global Address Spaces (PGAS) parallel model to
for general-purpose applications, and two 32-bit ARM cores heterogeneous computing. This allows hardware accelerators
for real-time applications in the same die. Four USB-C ports to directly access any memory location in the system, and
managed by the FPGA MGTs are available for interconnect- locality makes coherency techniques unnecessary, greatly
ing the boards. A custom network interface (NI) in the FPGA simplifying communication. By integrating Dynamic Partial
provides support for all communications, allowing users to Reconfiguration (DPR) into the framework, accelerators can
focus on their applications written on an OpenMP extension be installed on the go. The UNILOGIC architecture was eval-
called OmpSs. The NI is divided into six main groups: a data uated on a custom prototype consisting of 8 interconnected
mover that deals with DMA transfers, RX and TX controllers, daughter boards, each with four Xilinx Zynq Ultrascale+
and FIFOs to cache packets. A router is interfaced with MPSoCs and 64 Gigabytes of DDR4 memory, yielding better
each NI and is responsible for handling the USB-C channels, energy and computing efficiency than conventional GPU or
monitoring the network, and establishing virtual circuits. CPU parallel platforms.
As part of the Axiom project, a custom software stack [101] In 2022, the supercomputer Cygnus [111] was updated
consisting of multiple layers was also developed. Its founda- [112] to follow a multi-hybrid accelerators approach based on
tion is a distributed shared-memory (DSM) architecture. The GPUs and FPGAs. 32 Albireo nodes were added to Cygnus,
main advantage of this approach is that it allows applications each consisting of 4 NVIDIA V100 GPUs and two Intel
to directly address physical memory by transparently rely- Stratix 10 FPGAs. Similar to previous systems, a dedicated
ing on an OS network. Several tests [102] and benchmarks FPGA network was created with a 2D torus topology with
have validated the effectiveness of the platform, pushing the improved stream capabilities, called CIRCUS [113]. Col-
project forward into IoT and edge computing [103]. laboration between the FPGAs and GPUs is achieved by
Progress in this field has led to the creation of the Xilinx using a DMA engine in the FPGA that accesses the GPU
Adaptive Compute Clusters (XACC) [104] group under directly, bypassing the CPU, and offering almost double the
the Xilinx Heterogeneous Accelerated Compute Clusters throughput.
(HACC) [105] initiative. This industry and academic collabo- Finally, Fugaku [114], the first supercomputer to win
ration focuses on the development of new architectures, tools, all four categories in the Top500, presented a prototype
and applications for next-generation computers. FPGA cluster, ESSPER [115]. Motivated by the impres-
As part of this initiative, several clusters were built at some sive continuous improvements in FPGAs regarding energy
of the world’s most prestigious universities in Switzerland, and performance, a cluster of 8 nodes, each with two Intel
the USA, Germany, and Singapore. At the Paderborn Univer- Stratix 10 FPGAs, was built and tested. This cluster was
sity’s National High-Performance Computing Center (PC2), interfaced with Fugaku using a novel approach called loosely-
high-performance clusters Noctua [106] and Noctua 2 [107] coupled, where a host-FPGA bridging network provides
were built to provide hardware to accelerate research on interoperability and flexibility to all nodes in Fugaku.
computing systems with high energy efficiency.
The Noctua 2 cluster was designed to fit common server E. COMMUNICATION SYSTEMS INFRASTRUCTURE
racks and be compatible with the network industry standards. Another field of application where clusters of FPGAs are
It has 36 nodes with 2 AMD Milan. A combination of 48 relevant is the emulation of communication system infras-
Xilinx Alveo and 32 Intel Stratix 10 GX FPGAs comprised tructure. The most important difference with manycore
the reconfigurable computing part of the cluster. Each Stratix emulation is the need to interface with analog systems. This
node has 4 pluggable QSFP+ at 40 Gb/s and each Alveo has 2 requirement implies providing additional external ports to
QSPF+ at 100 Gb/s links and depends on Intel tools, such as interface with radio front-ends.
oneAPI [108], OpenCL, and DSP Builder. A specific optical One of the first implementations was the Berkeley Emu-
switch is used to build a configurable point-to-point network lation Engine (BEE) [117] in 2003. Its main purpose was
between all FPGAs. to support design space exploration for real-time algorithms,
More recently, Enzian [109] was developed as a scalable focusing mainly on data-flow architectures for digital signal
platform to fill the void left by industry-specific hybrid plat- processing.
forms. The reason behind Enzian provides a general, open, BEE was designed to emulate the digital part of telecom-
and affordable platform for research on hybrid CPU-FPGA munication systems and to provide a flexible interface for
computing, escaping the niche of specific-purpose hybrid radio front-ends. Computations are performed inside BEE
platforms by providing a lot of flexibility. Explicit access Processing Units (BPU). Each BPU has a main processing
board (MPB) and 8 riser I/O cards for 2400 external signals. of CAD. Relying on industry specialists for PCB design has
The MPBs are the main computing boards hosting 20 Xilinx resulted in simpler and more reliable PCBs within a shorter
Virtex FPGAs, 16 zero-bus turnaround (ZBT) SRAMS, and project time horizon. In addition, it was possible to parallelize
8 high-speed connectors. FPGAs on the periphery of the the design process, allowing the academic community to
board have off-board connectors to link other MPBs. A hybrid focus on firmware development.
network consisting of a combination of a mesh network and The BEE collaboration presented its final iteration in 2010,
partial crossbar, called a hybrid-complete graph and a partial consisting of BEE4 and miniBEE [126], [127]. BEE4 was
crossbar (HCGP) [118], was implemented. A single-board updated to support Virtex 6 FPGAs and up to 128 GB of
computer (SBC) running Apache web services over Linux DDR3 RAM per module. The QSHs were removed in favor
allows users to deploy their applications and perform config- of FMC connectors to support a wider range of mezzanine
uration and slow control tasks. boards. BEE4 was built around the Honeycomb architecture
To take full advantage of the platform, an automated high- using the Sting I/O intermodule communication protocol.
level workflow was used [119] that relied on MATLAB The design tools were further refined to include Nectar OS
and Simulink to develop the main hardware blocks. The and BeeCube Platform Studio in MATLAB/Simulink, which
BEE compiler then processes the output and generates the are unfortunately proprietary. However, being a proprietary
required VHDL files for the simulation and configuration of system did not discourage its use in academia [128]. The
the system. A time-division multiple access (TDMA) receiver success of BEECube attracted further interest from the indus-
was fully implemented to satisfy real-time requirements and try, and was bought by National Instruments in 2015 [129].
validate the workflow. Today, it is a part of the FlexRIO [130] line-up, and soft-
Following the BEE success, the BEE2 [120] was conceived ware development is supported by NI tools. From this point
as a universal, standard reconfigurable computing system onward, almost all implementations depend on commercially
consisting of 5 Virtex 2 FPGAs, each with 4 DIMM connec- available emulation platforms.
tors for up to 4GB of RAM. Four FPGAs are available for To demonstrate the scalability of such implementations,
computing, and one was reserved for control tasks. Pivoting the world’s largest wireless network emulator was built,
away from the HCGP, an onboard mesh was implemented Colosseum [131], which can compute workloads of 820 Gb/s
between the 4 computing FPGAs. Using high-speed links, and perform 210 T operations per second. It was formed
it was possible to aggregate the 5 FPGAs and use them as by CUs that consisted of three FPGAs in a chain. The
a single, larger FPGA. The workflow remained almost the outer FPGAs were used to interface with the radios and
same for BEE2, with the main change being the use of a provide some processing. The central FPGA is dedicated
computational model of synchronous data flow for both the to digital signal processing. Commercially available solu-
microprocessor and FPGA. tions were selected to avoid complications when designing
To overcome the shortcomings of BEE2 and take advantage the custom board. For the radio-attached FPGAs, 128
of the already validated Spirit architecture [121], a digital USRP-X312 [132] software-defined radios were used. Each
wireless channel emulator (DWCE) [122], [123] was devel- provides the analog interfaces required for the antennas,
oped. It consisted of 64 nodes in the same way as Spirit, along with a Kintex 7 FPGA. As dedicated processing
but with valuable upgrades to demonstrate the capabilities of FPGAs, 16 NI ATCA-3671 [133] modules were used, each
FPGA clusters with military radios. Its capabilities improved hosting 4 FPGAs. The 64 processing FPGAs were intercon-
with an upgraded FPGA, additional 2 FMC connectors, and nected in a 4 × 4 × 4 HyperX topology [134] which allowed
the adoption of a standard MicroTCA.4 form factor. the data to be efficiently distributed for processing.
Considering the possible improvements to BEE and The NI modules are based on the BEE architecture and
because it was being developed as part of the research accel- support the same development tools. Given the complexity
erator for multiple processors (RAMP) community [124], of the system, a Python data-flow emulator [135] was built
a fast response was presented in the form of BEE3 [116]. The to confirm the topology and architecture of the system. It is
development of BEE3 differed from previous iterations and possible to confirm the latency of the system by providing
successfully demonstrated a new collaboration methodology models of the implemented components and topology.
between industry and academia [125]. Another notable contribution of this study is the proposal
The architecture of BEE3 changed substantially from that of a data flow methodology [136]. It comprises three guiding
of its predecessor by removing the control FPGA and intro- principles that highlight the issues present in other implemen-
ducing a control module on a smaller PCB. Another important tations. The first principle is the use of a unified interface
aspect worth highlighting is that, for the first time, a PCB was for modular components to favor portability. Second, when
intentionally developed to support different FPGA parts, all dealing with heterogeneous systems, the suggested approach
interconnected using a DDR2 interface in a ring topology. is asynchronous processing to decouple operations from time
The BEE3 prototype had approximately 30 collaborators, and favor parallelization. Finally, based on design best prac-
most of whom were professionals with extensive knowledge tices, solutions are urged to be vendor-independent.
TABLE 6. Communication systems emulation clusters’ contributions, reported power and performance gains.
A. NETWORK
A cluster is no more than a set of computational ele-
ments (CE) that collaborate toward a common goal. The
collaboration method and its means are crucial for ensur-
FIGURE 7. Main concepts for the proposed classification of clusters. ing implementation efficiency. The means of collaboration
branch out from the hardware interfaces to the commu-
nication protocols and, ultimately, the schedulers or other
II. CLASSIFICATION methods of synchronization. With this consideration, we can
After studying each of the works described above, it was draw a line between systems that delegate communication
possible to identify common elements. These elements reflect tasks to an external entity and those that incorporate the stack.
the decisions made by the designers when conceiving each Another important aspect of the interconnection is how it is
cluster. Given that heterogeneous computing is broad and handled. In high-speed stream computing, it is desired that
complex, until now no universal methodology has been devel- the communication be established as direct data channels
oped to design a cluster. A classification system was proposed with back-pressure, and this particular aspect is difficult to
in [94] based on the uniformity of the system and its nodes. replicate with purely routed networks.
We proposed segmenting the cluster infrastructure into Table 7 shows several aspects that distinguish the imple-
three main components, as shown in Figure 7. The first aspect mentations concerning network infrastructure for all the
is the network. This covers the physical interfaces chosen to works presented in Section I. The manner in which nodes
connect the nodes, logic protocols, and topologies. Another are connected is discriminated according to the existence of
important aspect to consider is the hardware available in the any additional hardware that processes, redirects, or interprets
CU. Each CU can have more than one CE type. Finally, a stream of data or packages between adjacent nodes as an
software tools that allow the cluster to be securely available indirect interconnection. This implies that a direct intercon-
to users for development were considered. They encompass nection is such that one node can interact directly with the
nearest neighbor without the need for additional networking performance depended on the size of the packet. For smaller
hardware, excluding physical interfaces. In these implemen- packets, the latency was dominant, tipping the scale in favor
tations, network services are provided by in-fabric routers of direct interconnection, but for larger packets (> 1 MB),
and switches, which allow users to experiment with different the switched implementation offered an improvement of
protocols at the expense of resources. This is particularly approximately 5%.
crucial in implementations targeting heavy communication To compare the impact of both network connections, a leaf-
problems that require low latency. By no means, the depen- spine topology was implemented for the switched network,
dence on external hardware imposes a disadvantage because and a ring topology for direct interconnection. The switched
recent implementations show that it is capable of extending network was modeled for 2048 FPGAs using 64 radix
scaling capabilities without affecting performance, as in the switches. The ring topology was simulated for a direct net-
case of [25], which effectively interconnects thousands of work. These simulations showed that for a small message size
CEs. As shown in [141], adding dedicated network hardware (≈ < 1 MB), a direct network offers a shorter transmission
increases the latency by a constant factor. To determine the time than a switched network, regardless of the number of
impact on performance, a ring topology was implemented nodes. In contrast, larger payloads (> 227 MB) benefit from
using the E40G protocol. The experiments showed that the a switched network, but only up to 1024 nodes when the direct
Table 8 shows the most relevant studies classified accord- internal gates, GPIOs, or MGTs. Typically, in HPC, this
ing to the characteristics of their CU. The Axel classification is not out for discussion, given that CPUs and GPUs have
system was used to identify the uniformity of the node (CU) fixed interfaces, but FPGAs are not bounded by this. Thus,
and system. The total number of CUs is also presented. Some the communication layer can be either available for users
studies have presented the architecture of a single node as to freely customize and test or fixed by the developers and
a building block for a future cluster. These are considered provided as a service. In addition, we have actual CEs; these
relevant for their contribution to the study of heterogeneous can be CPU, GPU, or FPGA custom cores. It is in this
workloads. The form factor of each work is shown in the part where actual computing is performed, and users may
respective column. Given that these are heterogeneous sys- be able to define the entities, or developers may provide
tems, different types of CE may be present at CUs, and programmable blocks. To interact safely with these block
sometimes even among CUs. Finally, Table 8 shows the total drivers, a file system and a scheduler may be provided as
number of CEs implemented in the system. an operating system. This creates a safe space for users to
As previously described, classification based on node (CU) build applications based on the hardware and communication
and system uniformity is useful for understanding the pro- services. At this level, users must rely on a programming
gramming paradigm. According to the previous definition of language that describes how the underlying parts cooperate
nodes, multiple CEs can be hosted on a single node. A balance for the intended computation. Some studies have presented
between a crowded or simple node resides in the diversity of new programming languages that aim to capture the different
the CEs and network infrastructure. Diverse CEs in a single programming paradigms in heterogeneous clusters. Tools that
node allow for the highest resource availability per CU for take the abstract description of the computation and transform
developers, but it remains a challenge to interface all devices it into instructions may be provided as a contained solution or
considering all the different ports. This leads to different along libraries and APIs to facilitate development. Another
form factors that directly impact the way the cluster scales, level of abstraction may be introduced, in which users interact
and more importantly, the availability of physical structures with prebuilt blocks in a fixed context inside a GUI.
to hold the nodes in place and provide efficient cooling. Table 9 shows a series of works with the development
Custom CU form factors usually host several CEs and can tools provided and the intended application. Depending on
compromise not only scalability, but also fault tolerance. This the scope of the application, user needs vary and may require
is the case for ARUZ [25], where a single node hosts up deeper access to the system or more abstract tools. Most
to 11 FPGAs. In an unfortunate case where one or more CEs general-purpose clusters are intended to be used as research
break down, the OS must be notified to circumvent these platforms. This requirement relaxes many management appli-
nodes or completely ignore them until they are fixed. In this cations and abstraction layers that, in turn, must be provided
regard, COTS clusters have a great advantage: the up-bring to the user in other cases. As research could take part in
cost is mostly absorbed by the industry by providing tested the lowest level of communication, users may need the free-
and validated nodes for quick installation, which is the case dom to change the electrical standard of the GPIOs or the
for Noctua 2 [107] and Catapult [22], among others. Some encoding of the MGT. These properties are only available if
rare cases of industry and academia collaboration greatly ben- the user sees the platform as a bare metal solution, or if the
efit from COTS advantages with specific research-motivated development environment has a standardized way of defining
modifications, as in Novo-G, [97] and BEE3 [125]. communication devices. In any case, most systems avoid
this by providing the user with a template that abstracts the
C. SOFTWARE TOOLS communication layer. Specific application clusters seek to
Finally, each work discussed would be incomplete if there encapsulate most of the details such that the user faces only
were no tools available to help users develop their appli- the challenges related to the application.
cations. These tools provide different layers of isolation, The flexibility of the software stack also depends on the
ranging from templates that encapsulate internode com- platform’s openness. In this regard, FPGA development
munication to complete operating systems that manage frameworks have been significantly delayed as opposed to
multiple-user access. Each of the tools offers a degree of CPU and GPU. Currently, one can use complete open-source
abstraction encapsulating all underlying details to offer ser- frameworks to develop applications for CPUs and GPUs,
vices to the user or to a higher layer. The depth of the layer but FPGAs are radically different. One reason for this is
stack depends on several factors: that the stack of tools is fundamentally different. Instead of
targeting fixed hardware through a well-known and well-
• Purpose of the cluster defined instruction set architecture (ISA), FPGA tools target
• Degree of freedom intended for the user (isolation) configuration memory with architecture-specific informa-
• Cluster flexibility tion. These architectural details tend to be industry secrets
A stack of tools can be structured according to the ser- that force developers to rely on vendor tools with all their
vices provided and required, as shown in Figure 9. First, benefits and limitations. One of the most important limita-
we have the interface with the external worldatn a physical tions is the proprietary nature of some vendor tools. Efforts
level. Naturally, we rely on electric signals controlled by have been made to create completely open-source workflows
TABLE 8. Hardware architecture of computation units (CU) with respect to their computational elements (CE).
TABLE 8. (Continued.) Hardware architecture of computation units (CU) with respect to their computational elements (CE).
such as F4PGA [8], in which experienced users can actively III. OPEN PROBLEMS AND TRENDS
collaborate to improve the platform. Finally, an important Supercomputing is a complex and fast-evolving field in which
aspect directly tied to the application is the level of flexibility CPUs and GPUs have traditionally dominated the market.
provided by the cluster. Some applications can implement Several successful attempts have been made to introduce
external hardware for optional data streams. This is the case FPGAs in this context, such as F1 instances in Amazon and
for all the BEE implementations. Other domains of flexibility the IBM cloud FPGA service. The flexibility and energy
include the network topology and communication protocol of efficiency of FPGAs strongly challenge CPU and GPU for
the cluster. The portability of the framework was described the same computing tasks, further motivating research in this
by the flexibility of the CEs. This means that the cluster CEs area.
could potentially be updated or changed without requiring The opportunities that FPGAs offer to heterogeneous
important modifications of the development tools, future- computing are huge. As already shown in several stud-
proofing them, and providing customization depending on the ies [14], [142], [143], [144], [145], FPGAs can surpass CPU
user’s needs. energy efficiency by orders of magnitude by relying on
• Runtime performance analysis tools to identify layers for user interaction and management, showing great
bottlenecks. improvement in the execution phase. Likewise, several FPGA
In 2008, the strategic infrastructure for reconfigurable operating systems have been developed [159], [160], [161]
computing applications (SIRCA) [149] provided a com- implementing abstractions such as threads. Nevertheless,
prehensive study of the tools required for the adoption of some challenges remain, notably:
mainstream reconfigurable computing. This study separated • Translation tools capable of targeting scalable heteroge-
the tools based on four relevant phases: formulation, design, neous platforms
translation, and execution. • High-level prediction tools for performance, energy con-
The initial phase, in which the algorithms are elaborated sumption, and resource utilization, among others
and optimized for parallel computing, is referred to as the • Universal debugging and verification tools for dis-
formulation. This is the highest level of abstraction, mostly tributed reconfigurable computing
dealing with pseudo-codes and verbal language for reasoning.
Even if there are platform-specific solutions to some of
SIRCA highlighted the need for tools that aid developers
the previously mentioned challenges, the real challenge is
in making strategic decisions that favor the parallel model
to develop standard and generic solutions suitable for any
embedded in heterogeneous computing rather than leaving
heterogeneous cluster implementation in a community-driven
the decisions to the later phases. Formulation is the most
development approach that would greatly accelerate adoption
critical step in which researchers can benefit the most from
and growth, as shown in [162], [163], and [164]. At high-level
insight into the paradigm present in the targeted hetero-
synthesis, novel frameworks provide a convenient approach
geneous system. Tools that provide strategic exploration,
by including off-chip synchronization and communication
high-level prediction, and numerical analysis have a strongly
APIs, such as Auto-Pipe [165] in 2010 and, more recently,
positive impact on the other phases.
OpenFC [166] and SMAPPIC [167].
The design phase consists of the languages used to translate
an algorithm into a behavioral implementation. This field has
been broadened by the creation and adaptation of modern IV. CONCLUSION
HDL languages, such as Chisel [150], [151] based on Scala Supercomputers have been growing in recent years to occupy
and Clash [152], [153] based on Haskell, and high-level syn- large areas and consume as much energy as small towns. This
thesis tools, such as BondManchine based on Go [154] among trend is impossible to support, and highlights the major issues
several others. New developments have solved, to some of the current approach based on CPUs and GPUs. Mean-
degree, the issues of portability and interoperability by raising while, FPGA-based heterogeneous platforms have shown
abstraction. However, the method for scaling designs to het- great improvements in performance and energy consumption
erogeneous clusters remains platform-specific. Without these when compared to their CPU or GPU counterparts. Nonethe-
facilities, users are expected to be responsible for porting less, adoption has remained low, primarily owing to the
and partitioning the design. Furthermore, users are tasked complexity of hardware design and the lack of standards for
with specifying the concurrency model at the system level, interconnection, structure, and program description, to men-
which is a difficult task. An in-depth study of design tools, tion some that affect most development tools by forcing
frameworks, and strategies for design space exploration can over-specification.
be found in [155]. By studying the most relevant implementations of FPGA
Once a PC-compatible description of the algorithm is avail- heterogeneous clusters, we propose three main domains in
able, the next phase maps it to the actual physical resources each cluster, namely, network, hardware, and software tools,
of the system. This phase is known as translation or place- that help recognize the contributions and challenges of each
and-route (PAR). Several improvements were made in recent work. Furthermore, studying a specific cluster architecture
years [156], [157]. Most focus on the speed-up of the process under this division aids in identifying the origin of some
by implementing parallelism with good results when com- issues and understanding the compromises of design deci-
pared with vendor tools. However, these improvements are sions taken in the different domains. By understanding the
not easily integrated into proprietary workflows and require trade-offs related to each decision, developers can better
a high level of expertise for effective usage. Likewise, exist- anticipate the critical issues in each domain to plan contin-
ing PARs targeting clusters are platform-specific and do not gency measures in the most convenient manner. This survey
change until a standard way of describing a heterogeneous sheds light on the open challenges that future clusters will
system is adopted. have to overcome but also offers an overview of the already
In the final phase, execution, developers must be able to available and tested approaches.
verify and analyze the performance of the implementation. FPGA-based heterogeneous computing is a challenging
Critical runtime services must be included, such as task man- field, with enormous potential to change the dominant
agement, checkpoints, heartbeats, and debugging services. computing paradigm. In later years, great interest brought
The effective implementation of such services depends on important contributions to the development of tools and,
their consideration in previous design phases. The works more importantly, experimental platforms. With standard
studied in detail in [158] provided definitions of abstraction platform descriptions and interfaces, an open collaborative
67700 VOLUME 11, 2023
W. F. Samayoa et al.: Survey on FPGA-Based Heterogeneous Clusters Architectures
development approach will allow the creation of commu- [20] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and
nities to accelerate adoption. New technologies, such as K. Skadron, ‘‘Rodinia: A benchmark suite for heterogeneous comput-
ing,’’ in Proc. IEEE Int. Symp. Workload Characterization (IISWC),
SoC-FPGAs, will certainly be at the center of future cluster Oct. 2009, pp. 44–54, doi: 10.1109/IISWC.2009.5306797.
architectures, considering the advantages of having CPUs, [21] Virginia Tech Synergy. (2019). GitHub—VTSynergy/OpenDwarfs: A
GPUs, and FPGAs in the same device. Benchmark Suite. [Online]. Available: https://github.com/vtsynergy/
OpenDwarfs
[22] A. Putnam et al., ‘‘A reconfigurable fabric for accelerating large-scale
ACKNOWLEDGMENT datacenter services,’’ in Proc. ACM/IEEE 41st Int. Symp. Comput. Archit.
The authors would like to thank Romina Soledad Molina and (ISCA), Jun. 2014, pp. 13–24, doi: 10.1109/ISCA.2014.6853195.
[23] Alibaba. (2018). Deep Dive Into Alibaba Cloud F3 FPGA as
Charn Loong Ng for their valuable insight in the process of a Service Instances—Alibaba Cloud Community. [Online]. Avail-
writing this article. able: https://www.alibabacloud.com/blog/deep-dive-into-alibaba-cloud-
f3-fpga-as-a-service-instances_594057
[24] Amazon. (2017). Amazon EC2 F1 Instances. [Online]. Available:
REFERENCES https://aws.amazon.com/ec2/instance-types/f1/
[1] C. Maxfield. (Sep. 2011). Who Made the First PLD?—EETimes. [Online]. [25] R. Kiełbik, K. Hałagan, W. Zatorski, J. Jung, J. Ulański, A. Napieralski,
Available: https://www.eetimes.com/who-made-the-first-pld/ K. Rudnicki, P. Amrozik, G. Jabłoński, D. Stożek, P. Polanowski,
[2] (2017). Xilinx Co-Founder Ross Freeman Honored—EETimes. [Online]. Z. Mudza, J. Kupis, and P. Panek, ‘‘ARUZ—Large-scale, massively
Available: https://www.eetimes.com/xilinx-co-founder-ross-freeman- parallel FPGA-based analyzer of real complex systems,’’ Comput.
honored/ Phys. Commun., vol. 232, pp. 22–34, Nov. 2018. [Online]. Available:
[3] Xilinx. (2021). Vivado Design Suite User Guide, Version 2021.1. https://linkinghub.elsevier.com/retrieve/pii/S0010465518302182, doi:
[Online]. Available: https://www.xilinx.com/support/documentation/sw_ 10.1016/j.cpc.2018.06.010.
manuals/xilinx2021_1/ug973-vivado-release-notes-install-license.pdf [26] F. Fahim et al., ‘‘hls4ml: An open-source codesign workflow to
[4] Xilinx. (2021). Vitis Unified Software Platform User Guide, Version empower scientific low-power machine learning devices,’’ 2021,
2021.1. [Online]. Available: https://www.xilinx.com/support/document arXiv:2103.05579.
ation/sw_manuals/xilinx2021_1/ug1416-vitis-unified-platform.pdf [27] J. Villarreal, A. Park, W. Najjar, and R. Halstead, ‘‘Designing modu-
[5] Intel Corporation. (2021). Quartus Prime User Guide, Version lar hardware accelerators in C with ROCCC 2.0,’’ in Proc. 18th IEEE
21.1. [Online]. Available: https://www.intel.com/content/dam/www/ Annu. Int. Symp. Field-Program. Custom Comput. Mach., May 2010,
programmable/us/en/pdfs/literature/ug/ug-qps.pdf pp. 127–134, doi: 10.1109/FCCM.2010.28.
[6] Microsemi Libero. (2021). Libero SoC Design Suite User Guide, Version [28] R. Nane, V. Sima, B. Olivier, R. Meeuws, Y. Yankova, and K. Bertels,
12.0, 2021. [Online]. Available: https://www.microsemi.com/document- ‘‘DWARV 2.0: A CoSy-based C-to-VHDL hardware compiler,’’ in
portal/doc_view/131953-libero-soc-design-suite-v12-0-user-guide Proc. 22nd Int. Conf. Field Program. Log. Appl. (FPL), Aug. 2012,
[7] (2021). Yosys Open SYnthesis Suite. Accessed: May 9, 2023. [Online]. pp. 619–622, doi: 10.1109/FPL.2012.6339221.
Available: https://github.com/YosysHQ/yosys [29] A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong,
and W.-M.-W. Hwu, ‘‘Efficient compilation of CUDA kernels for
[8] CHIPS Alliance. (2017). FOSS Flows for FPGA—F4PGA
high-performance computing on FPGAs,’’ ACM Trans. Embedded Com-
Documentation. [Online]. Available: https://f4pga.readthedocs.
put. Syst., vol. 13, no. 2, pp. 1–26, Sep. 2013, doi: 10.1145/2514641.
io/en/latest/index.html
2514652.
[9] Agile Analog. (2021). RapidSilicon: Accelerating Silicon Development.
[30] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski,
Accessed: May 9, 2023. [Online]. Available: https://www.agileanalog.
S. D. Brown, and J. H. Anderson, ‘‘LegUp: An open-source high-level
com/products/rapidsilicon
synthesis tool for FPGA-based processor/accelerator systems,’’ ACM
[10] W. A. Najjar and P. Ienne, ‘‘Reconfigurable computing,’’ IEEE Micro, Trans. Embedded Comput. Syst., vol. 13, no. 2, pp. 1–27, Sep. 2013.
vol. 34, no. 1, pp. 4–6, Jan. 2014. [Online]. Available: https://dl.acm.org/ [Online]. Available: https://doi-org.ezproxy.cern.ch/10.1145/2514740,
doi/10.1145/508352.508353, doi: 10.1109/MM.2014.25. doi: 10.1145/2514740.
[11] Altera Corporation. (Jul. 2014). What is an SoC FPGA? Architecture [31] S. Lee, J. Kim, and J. S. Vetter, ‘‘OpenACC to FPGA: A frame-
Brief. [Online]. Available: http://www.altera.com/socarchitecture work for directive-based high-performance reconfigurable computing,’’
[12] W. Vanderbauwhede et al., High-Performance Computing Using FPGAs. in Proc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS), May 2016,
New York, NY, USA: Springer, 2013. pp. 544–554, doi: 10.1109/IPDPS.2016.28.
[13] M. Awad, ‘‘FPGA supercomputing platforms: A survey,’’ in Proc. [32] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee,
Int. Conf. Field Program. Log. Appl., Aug. 2009, pp. 564–568. J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and
[Online]. Available: http://ieeexplore.ieee.org/document/5272406/, doi: A. Agarwal, ‘‘Baring it all to software: Raw machines,’’ Computer,
10.1109/FPL.2009.5272406. vol. 30, no. 9, pp. 86–93, 1997. [Online]. Available: http://ieeexplore.
[14] K. O’Neal and P. Brisk, ‘‘Predictive modeling for CPU, GPU, and ieee.org/document/612254/, doi: 10.1109/2.612254.
FPGA performance and power consumption: A survey,’’ in Proc. IEEE [33] J. D. Davis, ‘‘FAST: A flexible architecture for simulation and testing
Comput. Soc. Annu. Symp. VLSI (ISVLSI), Jul. 2018, pp. 763–768, doi: of multiprocessor and CMP systems,’’ Dept. Elect. Eng., Stanford Univ.,
10.1109/ISVLSI.2018.00143. Stanford, CA, USA, Dec. 2006.
[15] P. Colella. (2004). Defining Software Requirements for Scientific [34] H. Kalte, M. Porrmann, and U. Rückert, ‘‘A prototyping platform for
Computing. DARPA HPCS. [Online]. Available: https://www.krellinst. dynamically reconfigurable system on chip designs,’’ in Proc. IEEE
org/doecsgf/conf/2013/pres/pcolella.pdf Workshop Heterogeneous Reconfigurable Syst. Chip (SoC), Hamburg,
[16] K. Asanovic et al., ‘‘The landscape of parallel computing research: Germany, Apr. 2002, pp. 57–75.
A view from Berkeley,’’ 2006. [Online]. Available: http://www.eecs. [35] M. Porrmann et al., ‘‘RAPTOR—A scalable platform for rapid prototyp-
berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html ing and FPGA-based cluster computing,’’ in Parallel Computing: From
[17] R. Inta, D. J. Bowman, and S. M. Scott, ‘‘The ‘Chimera’: An off-the-shelf Multicores and GPU’s to Petascale (Advances in Parallel Computing),
CPU/GPGPU/FPGA hybrid computing platform,’’ Int. J. Reconfigurable vol. 19. Amsterdam, The Netherlands: IOS Press, 2010, doi: 10.3233/978-
Comput., vol. 2012, pp. 1–10, Jan. 2012. [Online]. Available: http://www. 1-60750-530-3-592.
hindawi.com/journals/ijrc/2012/241439, doi: 10.1155/2012/241439. [36] C. Steffen and G. Genest, ‘‘Nallatech in-socket FPGA front-side bus
[18] R. D. Chamberlain, ‘‘Architecturally truly diverse systems: A accelerator,’’ Comput. Sci. Eng., vol. 12, no. 2, pp. 78–83, Mar. 2010, doi:
review,’’ Future Gener. Comput. Syst., vol. 110, pp. 33–44, 10.1109/MCSE.2010.45.
Sep. 2020. [Online]. Available: https://linkinghub.elsevier.com/retrieve/ [37] C. Pohl, C. Paiz, and M. Porrmann, ‘‘vMAGIC—Automatic code
pii/S0167739X19313184, doi: 10.1016/j.future.2020.03.061. generation for VHDL,’’ Int. J. Reconfigurable Comput., vol. 2009,
[19] R. Palmer. (2011). Parallel Dwarfs (Inaccessible). [Online]. Available: pp. 1–9, Jan. 2009. [Online]. Available: http://vmagic.sourceforge.net/,
http://paralleldwarfs.codeplex.com/ doi: 10.1155/2009/205149.
[38] S. Lyberis, G. Kalokerinos, M. Lygerakis, V. Papaefstathiou, [55] W. Kastl and T. Loimayr, ‘‘A parallel computing system with special-
I. Mavroidis, M. Katevenis, D. Pnevmatikatos, and D. S. Nikolopoulos, ized coprocessors for cryptanalytic algorithms,’’ in P170—Sicherheit
‘‘FPGA prototyping of emerging manycore architectures for parallel 2010—Sicherheit, Schutz und Zuverlässigkeit, F. C. Freiling, Ed. Bonn,
programming research using formic boards,’’ J. Syst. Archit., vol. 60, Germany: Gesellschaft für Informatik, 2010, pp. 78–83. [Online]. Avail-
no. 6, pp. 481–493, Jun. 2014. [Online]. Available: https://linkinghub. able: https://dl.gi.de/handle/20.500.12116/19801
elsevier.com/retrieve/pii/S138376211400054X, doi: 10.1016/j.sysarc. [56] B. Danczul, J. Fuß, S. Gradinger, B. Greslehner, W. Kastl, and F. Wex,
2014.03.002. ‘‘Cuteforce analyzer: A distributed bruteforce attack on PDF encryption
[39] S. Lyberis, G. Kalokerinos, M. Lygerakis, V. Papaefstathiou, with GPUs and FPGAs,’’ in Proc. Int. Conf. Availability, Rel. Secur.,
D. Tsaliagkos, M. Katevenis, D. Pnevmatikatos, and D. Nikolopoulos, Sep. 2013, pp. 720–725. [Online]. Available: http://ieeexplore.ieee.
‘‘Formic: Cost-efficient and scalable prototyping of manycore org/document/6657310/, doi: 10.1109/ARES.2013.94.
architectures,’’ in Proc. IEEE 20th Int. Symp. Field-Program. Custom [57] A. H. T. Tse, D. B. Thomas, K. H. Tsoi, and W. Luk,
Comput. Mach., Apr. 2012, pp. 61–64, doi: 10.1109/FCCM.2012.20. ‘‘Dynamic scheduling monte-carlo framework for multi-accelerator
[40] H. Shah et al., ‘‘Remote direct memory access (RDMA) protocol exten- heterogeneous clusters,’’ in Proc. Int. Conf. Field-Program. Technol.,
sions,’’ Tech. Rep. 7306, Jun. 2014. Dec. 2010, pp. 233–240. [Online]. Available: http://ieeexplore.ieee.
[41] V. Kale, ‘‘Using the MicroBlaze processor to accelerate cost-sensitive org/document/5681495/, doi: 10.1109/FPT.2010.5681495.
embedded system development,’’ Xilinx, Jun. 2016. [Online]. Available: [58] G. Tan, C. Zhang, W. Wang, and P. Zhang, ‘‘SuperDragon,’’ ACM
https://docs.xilinx.com/v/u/en-US/wp469-microblaze-for-cost-sensitive- Trans. Reconfigurable Technol. Syst., vol. 8, no. 4, pp. 1–22, Oct. 2015.
apps [Online]. Available: https://dl.acm.org/doi/10.1145/2740966, doi:
[42] S. G. Kavadias, M. G. H. Katevenis, M. Zampetakis, and 10.1145/2740966.
D. S. Nikolopoulos, ‘‘On-chip communication and synchronization [59] S. W. Moore, P. J. Fox, S. J. T. Marsh, A. T. Markettos, and A. Mujumdar,
mechanisms with cache-integrated network interfaces,’’ in Proc. 7th ‘‘Bluehive–A field-programable custom computing machine for extreme-
ACM Int. Conf. Comput. Frontiers, May 2010, pp. 217–226, doi: scale real-time neural network simulation,’’ in Proc. IEEE 20th Int.
10.1145/1787275.1787328. Symp. Field-Program. Custom Comput. Mach., Apr. 2012, pp. 133–140.
[43] Cadence. (2019). Palladium Emulation | Cadence. [Online]. Available: [Online]. Available: https://ieeexplore.ieee.org/document/6239804/, doi:
https://www.cadence.com/en_US/home/tools/system-design-and- 10.1109/FCCM.2012.32.
verification/emulation-and-prototyping/palladium.html [60] P. J. Fox, A. T. Markettos, and S. W. Moore, ‘‘Reliably prototyping
[44] Siemens. (2022). Veloce Prototyping—FPGA | Siemens Software. large SoCs using FPGA clusters,’’ in Proc. 9th Int. Symp. Reconfig-
[Online]. Available: https://eda.sw.siemens.com/en-US/ic/veloce/fpga- urable Commun.-Centric Syst.-on-Chip (ReCoSoC), May 2014, pp. 1–8.
prototyping/ [Online]. Available: http://ieeexplore.ieee.org/document/6861350/, doi:
[45] B. da Silva, A. Braeken, E. H. D’Hollander, A. Touhafi, J. G. Cornelis, 10.1109/ReCoSoC.2014.6861350.
and J. Lemeire, ‘‘Comparing and combining GPU and FPGA accelerators [61] A. Theodore Markettos, P. J. Fox, S. W. Moore, and A. W. Moore,
in an image processing context,’’ in Proc. 23rd Int. Conf. Field Program. ‘‘Interconnect for commodity FPGA clusters: Standardized or cus-
Log. Appl., Sep. 2013, pp. 1–4. [Online]. Available: http://ieeexplore. tomized?’’ in Proc. 24th Int. Conf. Field Program. Log. Appl. (FPL),
ieee.org/document/6645552/, doi: 10.1109/FPL.2013.6645552. Sep. 2014, pp. 1–8. http://ieeexplore.ieee.org/document/6927472/, doi:
[46] T. Otsuka, T. Aoki, E. Hosoya, and A. Onozawa, ‘‘An image recognition 10.1109/FPL.2014.6927472.
system for multiple video inputs over a multi-FPGA system,’’ in Proc. [62] R. S. Nikhil et al., BSV by Example, 10th ed. 2010. [Online]. Available:
IEEE 6th Int. Symp. Embedded Multicore SoCs, Sep. 2012, pp. 1–7. http://www.bluespec.com/support/
[Online]. Available: http://ieeexplore.ieee.org/document/6354671/, doi: [63] M. Baity-Jesi et al., ‘‘Janus II: A new generation application-driven com-
10.1109/MCSoC.2012.33. puter for spin-system simulations,’’ Comput. Phys. Commun., vol. 185,
[47] The RTN Collaboration, ‘‘64-transputer machine,’’ in Proc. CHEP, no. 2, pp. 550–559, Feb. 2014. [Online]. Available: https://linkinghub.
Geneva, Switzerland, 1992, pp. 353–360. elsevier.com/retrieve/pii/S0010465513003470, doi: 10.1016/j.cpc.2013.
[48] H. Schmit et al., ‘‘Behavioral synthesis for FPGA-based comput- 10.019.
ing,’’ in Proc. IEEE Workshop FPGA’s Custom Comput. Mach., 1994, [64] R. Kiełbik, K. Rudnicki, Z. Mudza, and J. Jung, ‘‘Methodology of
pp. 125–132, doi: 10.1109/FPGA.1994.315591. firmware development for ARUZ—An FPGA-based HPC system,’’ Elec-
[49] A. Cruz, J. Pech, A. Tarancón, P. Téllez, C. L. Ullod, and C. Ungil, tronics, vol. 9, no. 9, p. 1482, Sep. 2020. [Online]. Available: https://
‘‘SUE: A special purpose computer for spin glass models,’’ Com- www.mdpi.com/journal/electronics, doi: 10.3390/electronics9091482.
put. Phys. Commun., vol. 133, nos. 2–3, pp. 165–176, Jan. 2001, doi: [65] (2006). VHDL Preprocessor Home Page. [Online]. Available: https://
10.1016/S0010-4655(00)00170-3. sourceforge.net/projects/vhdlpp/
[50] F. Belletti, I. Campos, A. Maiorano, S. P. Gavir, D. Sciretti, A. Tarancon, [66] S. Karim, J. Harkin, L. McDaid, B. Gardiner, and J. Liu, ‘‘AstroByte:
J. L. Velasco, A. C. Flor, D. Navarro, P. Tellez, L. A. Fernandez, Multi-FPGA architecture for accelerated simulations of spiking astrocyte
V. Martin-Mayor, A. M. Sudupe, S. Jimenez, E. Marinari, F. Mantovani, neural networks,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE),
G. Poll, S. F. Schifano, L. Tripiccione, and J. J. Ruiz-Lorenzo, ‘‘Ianus: An Mar. 2020, pp. 1568–1573, doi: 10.23919/DATE48585.2020.9116312.
adaptive FPGA computer,’’ Comput. Sci. Eng., vol. 8, no. 1, pp. 41–49, [67] S. Yang, J. Wang, X. Hao, H. Li, X. Wei, B. Deng, and K. A. Loparo,
Jan. 2006. [Online]. Available: http://ieeexplore.ieee.org/document/ ‘‘BiCoSS: Toward large-scale cognition brain with multigranular neuro-
1563961/, doi: 10.1109/MCSE.2006.9. morphic architecture,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 33,
[51] F. Belletti et al., ‘‘Janus: An FPGA-based system for high- no. 7, pp. 2801–2815, Jul. 2022, doi: 10.1109/TNNLS.2020.3045492.
performance scientific computing,’’ Comput. Sci. Eng., vol. 11, no. 1, [68] D. Gratadour. (2021). Microgate—Green Flash. [Online]. Available:
pp. 48–58, Jan. 2009. [Online]. Available: https://ieeexplore.ieee.org/ http://green-flash.lesia.obspm.fr/microgate.html
document/4720223/, doi: 10.1109/MCSE.2009.11. [69] Y. Clénet et al. (2019). MICADO-MAORY SCAO Preliminary Design,
[52] M. Baity-Jesi et al., ‘‘An FPGA-based supercomputer for statistical Development Plan & Calibration Strategies. [Online]. Available:
physics: The weird case of Janus,’’ in High-Performance Computing https://hal.archives-ouvertes.fr/hal-03078430
Using FPGAs. New York, NY, USA: Springer, Mar. 2013, pp. 481–506. [70] A. Brown, D. Thomas, J. Reeve, G. Tarawneh, A. De Gennaro,
[Online]. Available: https://link-springer-com.ezproxy.cern.ch/chapter/ A. Mokhov, M. Naylor, and T. Kazmierski, ‘‘Distributed event-based
10.1007/978-1-4614-1791-0_16, doi: 10.1007/978-1-4614-1791-0_16. computing,’’ in Parallel Computing is Everywhere (Advances in Parallel
[53] S. Kumar, C. Paar, J. Pelzl, G. Pfeiffer, and M. Schimmler, ‘‘Breaking Computing), vol. 32. 2018, pp. 583–592. [Online]. Available: https://
ciphers with COPACOBANA—A cost-optimized parallel code breaker,’’ ebooks.iospress.nl/doi/10.3233/978-1-61499-843-3-583, doi: 10.3233/
in Proc. Int. Workshop Cryptograph. Hardw. Embedded Syst., in Lecture 978-1-61499-843-3-583.
Notes in Computer Science: Including Subseries Lecture Notes in Arti- [71] M. A. Petrovici, B. Vogginger, P. Müller, O. Breitwieser, M. Lundqvist,
ficial Intelligence and Lecture Notes in Bioinformatics, vol. 4249, 2006, L. Müller, M. Ehrlich, A. Destexhe, A. Lansner, R. Schüffny,
pp. 101–118, doi: 10.1007/11894063_9. J. Schemmel, and K. Meier, ‘‘Characterization and compensation of
[54] T. Güneysu, T. Kasper, M. Novotný, C. Paar, and A. Rupp, ‘‘Crypt- network-level anomalies in mixed-signal neuromorphic modeling plat-
analysis with COPACOBANA,’’ IEEE Trans. Comput., vol. 57, no. 11, forms,’’ PLoS ONE, vol. 9, no. 10, Oct. 2014, Art. no. e108590. [Online].
pp. 1498–1513, Nov. 2008. [Online]. Available: http://ieeexplore.ieee. Available: https://journals.plos.org/plosone/article?id=10.1371/journal.
org/document/4515858/, doi: 10.1109/TC.2008.80. pone.0108590, doi: 10.1371/journal.pone.0108590.
[72] I. Ohmura, G. Morimoto, Y. Ohno, A. Hasegawa, and M. Taiji, [88] R. Baxter, S. Booth, M. Bull, G. Cawood, J. Perry, M. Parsons,
‘‘MDGRAPE-4: A special-purpose computer system for molecular A. Simpson, A. Trew, A. McCormick, G. Smart, R. Smart, A. Cantle,
dynamics simulations,’’ Philos. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., R. Chamberlain, and G. Genest, ‘‘Maxwell—A 64 FPGA supercom-
vol. 372, Aug. 2014, Art. no. 20130387. [Online]. Available: https://pmc/ puter,’’ in Proc. 2nd NASA/ESA Conf. Adapt. Hardw. Syst. (AHS),
articles/PMC4084528/ and https://pmc/articles/PMC4084528/?report= Aug. 2007, pp. 287–294. http://ieeexplore.ieee.org/document/4291933/,
abstract and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4084528/, doi: 10.1109/AHS.2007.71.
doi: 10.1098/RSTA.2013.0387. [89] R. Baxter, S. Booth, M. Bull, G. Cawood, J. Perry, M. Parsons,
[73] J. Weerasinghe, F. Abel, C. Hagleitner, and A. Herkersdorf, ‘‘Enabling A. Simpson, A. Trew, A. McCormick, G. Smart, R. Smart, A. Cantle,
FPGAs in hyperscale data centers,’’ in Proc. IEEE 12th Int. Conf. Ubiq- R. Chamberlain, and G. Genest, ‘‘The FPGA high-performance comput-
uitous Intell. Comput., IEEE 12th Int. Conf. Autonomic Trusted Comput. ing alliance parallel toolkit,’’ in Proc. 2nd NASA/ESA Conf. Adapt. Hardw.
IEEE 15th Int. Conf. Scalable Comput. Commun. Associated Workshops Syst. (AHS), Aug. 2007, pp. 301–307, doi: 10.1109/AHS.2007.104.
(UIC-ATC-ScalCom), Aug. 2015, pp. 1078–1086. [Online]. Available: [90] O. Mencer, K. H. Tsoi, S. Craimer, T. Todman, W. Luk, M. Y. Wong,
https://ieeexplore.ieee.org/document/7518378/, doi: 10.1109/UIC-ATC- and P. H. W. Leong, ‘‘Cube: A 512-FPGA cluster,’’ in Proc.
ScalCom-CBDCom-IoP.2015.199. 5th Southern Conf. Program. Log. (SPL), Apr. 2009, pp. 51–57.
[74] Xilinx. (2016). Xilinx and IBM to Enable FPGA-Based Acceleration [Online]. Available: http://ieeexplore.ieee.org/document/4914907/, doi:
Within SuperVessel OpenPOWER Development Cloud. [Online]. 10.1109/SPL.2009.4914907.
Available: https://www.xilinx.com/news/press/2016/xilinx-and-ibm-to- [91] M. Showerman, J. Enos, A. Pant, V. Kindratenko, C. Steffen,
enable-fpga-based-acceleration-within-supervessel-openpower- R. Pennington, and W.-M. Hwu, ‘‘QP: A heterogeneous multi-accelerator
development-cloud.html cluster,’’ in Proc. 10th LCI Int. Conf. High-Perform. Clustered Comput.,
[75] F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, and S. Paredes, Boulder, CO, USA, Mar. 2009, pp. 1–8.
‘‘An FPGA platform for hyperscalers,’’ in Proc. IEEE 25th Annu. [92] Xilinx. (2013). ISE Design Suite. [Online]. Available:
Symp. High-Perform. Interconnects (HOTI), Aug. 2017, pp. 29–32. https://www.xilinx.com/products/design-tools/ise-design-suite.html
[Online]. Available: http://ieeexplore.ieee.org/document/8071053/, doi: [93] A. Pant, H. Jafri, and V. Kindratenko, ‘‘Phoenix: A runtime environment
10.1109/HOTI.2017.13. for high performance computing on chip multiprocessors,’’ in Proc.
[76] J. Weerasinghe, F. Abel, C. Hagleitner, and A. Herkersdorf, ‘‘Disag- 17th Euromicro Int. Conf. Parallel, Distrib. Netw.-Based Process., 2009,
gregated FPGAs: Network performance comparison against bare-metal pp. 119–126, doi: 10.1109/PDP.2009.41.
servers, virtual machines and Linux containers,’’ in Proc. Int. Conf.
[94] K. H. Tsoi and W. Luk, ‘‘Axel,’’ in Proc. 18th Annu. ACM/SIGDA Int.
Cloud Comput. Technol. Sci. (CloudCom), Dec. 2016, pp. 9–17, doi:
Symp. Field Program. Gate Arrays, New York, NY, USA, Feb. 2010,
10.1109/CLOUDCOM.2016.0018.
p. 115. http://portal.acm.org/citation.cfm?doid=1723112.1723134, doi:
[77] B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner, and D. Fey, ‘‘Pro- 10.1145/1723112.1723134.
gramming reconfigurable heterogeneous computing clusters using MPI
[95] Adaptive Computing Enterprises. (2015). TORQUE Resource Man-
with transpilation,’’ in Proc. IEEE/ACM Int. Workshop Heterogeneous
ager Administrator Guide 4.2.10. [Online]. Available: http://www.
High-Perform. Reconfigurable Comput. (H2RC), Nov. 2020, pp. 1–9, doi:
adaptivecomputing.com
10.1109/H2RC51942.2020.00006.
[96] (2014). Maui Scheduler Administrator’s Guide. [Online]. Available:
[78] B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner, and D. Fey,
http://docs.adaptivecomputing.com/maui/
‘‘ZRLMPI: A unified programming model for reconfigurable hetero-
geneous computing clusters,’’ in Proc. IEEE 28th Annu. Int. Symp. [97] A. George, H. Lam, and G. Stitt, ‘‘Novo-G: At the forefront of scal-
Field-Program. Custom Comput. Mach. (FCCM), May 2020, p. 220, doi: able reconfigurable supercomputing,’’ Comput. Sci. Eng., vol. 13, no. 1,
10.1109/FCCM48280.2020.00051. pp. 82–86, Jan. 2011. [Online]. Available: http://ieeexplore.ieee.org/
[79] H. Shahzad, A. Sanaullah, and M. Herbordt, ‘‘Survey and future document/5678570/, doi: 10.1109/MCSE.2011.11.
trends for FPGA cloud architectures,’’ in Proc. IEEE High Per- [98] A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng,
form. Extreme Comput. Conf. (HPEC), Sep. 2021, pp. 1–10, doi: and C. Yang, ‘‘Novo-G#: Large-scale reconfigurable computing with
10.1109/HPEC49654.2021.9622807. direct and programmable interconnects,’’ in Proc. IEEE High Per-
[80] C. Bobda et al., ‘‘The future of FPGA acceleration in datacenters and form. Extreme Comput. Conf. (HPEC), Sep. 2016, pp. 1–7. [Online].
the cloud,’’ ACM Trans. Reconfigurable Technol. Syst., vol. 15, no. 3, Available: http://ieeexplore.ieee.org/document/7761639/, doi: 10.1109/
Sep. 2022, Art. no. 34, doi: 10.1145/3506713. HPEC.2016.7761639.
[81] R. Sass, W. V. Kritikos, A. G. Schmidt, S. Beeravolu, and [99] Xilinx. (Oct. 2017). Interlaken 150G. [Online]. Available:
P. Beeraka, ‘‘Reconfigurable computing cluster (RCC) project: https://docs.xilinx.com/v/u/en-US/pg212-interlaken-150g
Investigating the feasibility of FPGA-based petascale computing,’’ [100] R. Giorgi, ‘‘AXIOM: A 64-bit reconfigurable hardware/software plat-
in Proc. 15th Annu. IEEE Symp. Field-Program. Custom Comput. form for scalable embedded computing,’’ in Proc. 6th Medit. Conf.
Mach. (FCCM), Apr. 2007, pp. 127–140. [Online]. Available: Embedded Comput. (MECO), Jun. 2017, pp. 1–4. http://ieeexplore.ieee.
https://ieeexplore.ieee.org/document/4297250, doi: 10.1109/FCCM. org/document/7977173/, doi: 10.1109/MECO.2017.7977117.
2007.62. [101] C. Álvarez et al., ‘‘The AXIOM software layers,’’ Microprocessors
[82] A. G. Schmidt, W. V. Kritikos, S. Datta, and R. Sass, ‘‘Reconfigurable Microsyst., vol. 47, pp. 262–277, Nov. 2016, doi: 10.1016/J.MICPRO.
computing cluster project: Phase I brief,’’ in Proc. 16th Int. Symp. 2016.07.002.
Field-Program. Custom Comput. Mach., Apr. 2008, pp. 300–301, doi: [102] R. Giorgi, M. Procaccini, and F. Khalili, ‘‘AXIOM: A scalable,
10.1109/FCCM.2008.12. efficient and reconfigurable embedded platform,’’ in Proc. Design,
[83] AMD Xilinx. (Oct. 2022). Aurora 64B/66B LogiCORE IP Prod- Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2019, pp. 480–485, doi:
uct Guide. [Online]. Available: https://docs.xilinx.com/r/en-US/pg074- 10.23919/DATE.2019.8715168.
aurora-64b66b [103] A. Filgueras, M. Vidal, M. Mateu, D. Jiménez-González, C. Alvarez,
[84] R. G. Jaganathan, K. D. Underwood, and R. Sass, ‘‘A configurable X. Martorell, E. Ayguadé, D. Theodoropoulos, D. Pnevmatikatos, P. Gai,
network protocol for cluster based communications using modular S. Garzarella, D. Oro, J. Hernando, N. Bettin, A. Pomella, M. Procaccini,
hardware primitives on an intelligent NIC,’’ in Proc. ACM/IEEE and R. Giorgi, ‘‘The AXIOM project: IoT on heterogeneous embedded
Conf. Supercomput., Nov. 2003, p. 22, doi: 10.1145/1048935. platforms,’’ IEEE Design Test, vol. 38, no. 5, pp. 74–81, Oct. 2021, doi:
1050173. 10.1109/MDAT.2019.2952335.
[85] HPC Open. (2022). Open MPI: Open Source High Performance Comput- [104] AMD-Xilinx. (2021). Xilinx Adaptive Compute Clusters (XACC)
ing. [Online]. Available: https://www.open-mpi.org/ Academia-Industry Research Ecosystem | HACC Resources. [Online].
[86] K. Datta and R. Sass, ‘‘RBoot: Software infrastructure for a remote Available: https://www.amd-haccs.io/adapt_2021.html
FPGA laboratory,’’ in Proc. 15th Annu. IEEE Symp. Field-Program. [105] (2016). Heterogeneous Accelerated Compute Clusters | HACC
Custom Comput. Mach. (FCCM ), Apr. 2007, pp. 343–344, doi: Resources. [Online]. Available: https://www.amd-haccs.io/index.html
10.1109/FCCM.2007.53. [106] T. Prickett. (2018). Forging a Hybrid CPU-FPGA Supercomputer.
[87] Staff. (Jul. 2005). FPGA High-Performance Computing Alliance [Online]. Available: https://www.nextplatform.com/2018/09/25/forging-
(FHPCA). [Online]. Available: http://www.fhpca.org a-hybrid-cpu-fpga-supercomputer/
[107] Paderborn Center for Parallel Computing (PC2). (2022). PC2— [124] D. A. Patterson, ‘‘RAMP: Research accelerator for multiple
Noctua 2 (Universität Paderborn). [Online]. Available: https://pc2.uni- processors—A community vision for a shared experimental
paderborn.de/hpc-services/available-systems/noctua2 parallel HW/SW platform,’’ in Proc. IEEE Int. Symp. Perform.
[108] Intel. (2022). OneAPI: A New Era of Heterogeneous Computing. Anal. Syst. Softw., Mar. 2006, p. 1, doi: 10.1109/ISPASS.2006.
[Online]. Available: https://www.intel.com/content/www/us/en/ 1620784.
developer/tools/oneapi/overview.html [125] Wirbel Loring. (May 2010). Berkeley Emulation Engine Update—EDN.
[109] D. Cock, A. Ramdas, D. Schwyn, M. Giardino, A. Turowski, Z. He, [Online]. Available: https://www.edn.com/berkeley-emulation-engine-
N. Hossle, D. Korolija, M. Licciardello, K. Martsenko, R. Achermann, update/
G. Alonso, and T. Roscoe, ‘‘Enzian: An open, general, CPU/FPGA [126] J. Rothman and C. Chang, ‘‘BEE technology overview,’’ in Proc. Int.
platform for systems software research,’’ in Proc. 27th ACM Int. Conf. Conf. Embedded Comput. Syst. (SAMOS). Samos, Greece: Institute
Architectural Support Program. Lang. Operating Syst., Feb. 2022, p. 18, of Electrical and Electronics Engineers, Jan. 2013, p. 277, doi:
doi: 10.1145/3503222.3507742. 10.1109/SAMOS.2012.6404186.
[110] A. D. Ioannou, K. Georgopoulos, P. Malakonakis, D. N. Pnevmatikatos, [127] EDN. (Jun. 2010). DESIGN TOOLS—BEEcube Launches BEE4, a Full-
V. D. Papaefstathiou, I. Papaefstathiou, and I. Mavroidis, ‘‘UNILOGIC: Speed FPGA Prototyping Platform—EDN. [Online]. Available: https://
A novel architecture for highly parallel reconfigurable systems,’’ ACM www.edn.com/design-tools-beecube-launches-bee4-a-full-speed-fpga-
Trans. Reconfigurable Technol. Syst., vol. 13, no. 4, pp. 1–32, Dec. 2020, prototyping-platform/
doi: 10.1145/3409115. [128] M. Lin, ‘‘Hardware-assisted large-scale neuroevolution for multiagent
[111] Cygnus Consortium. (2018). About Cygnus. [Online]. Available: https:// learning,’’ Dept. Elect. Comput. Eng., Univ. Central Florida, Orlando,
www.ccs.tsukuba.ac.jp/wp-content/uploads/sites/14/2018/12/About- FL, USA, Dec. 2014. [Online]. Available: https://apps.dtic.mil/sti/
Cygnus.pdf citations/ADA621804
[112] T. Boku, N. Fujita, R. Kobayashi, and O. Tatebe, ‘‘Cygnus—World first [129] I. Sokol. (Apr. 2015). NIs BEEcube Acquisition Drives 5G Communi-
multihybrid accelerated cluster with GPU and FPGA coupling,’’ in Proc. cations | Microwaves & RF. [Online]. Available: https://www.mwrf.
ICPP Workshops. New York, NY, USA: Association for Computing com/technologies/systems/article/21846169/nis-beecube-acquisition-
Machinery, Aug. 2022, p. 1, doi: 10.1145/3547276.3548629. drives-5g-communications
[113] K. Kikuchi, N. Fujita, R. Kobayashi, and T. Boku, ‘‘Implementation [130] National Instruments. (2022). What is FlexRIO?—NI. [Online].
and performance evaluation of collective communications using CIRCUS Available: https://www.ni.com/it-it/shop/electronic-test-instrumentation/
on multiple FPGAs,’’ in Proc. HPC Asia Workshops. New York, NY, flexrio/what-is-flexrio.html
USA: Association for Computing Machinery, Feb. 2023, p. 1523, doi: [131] L. Bonati, P. Johari, M. Polese, S. D’Oro, S. Mohanti,
10.1145/3581576.3581602. M. Tehrani-Moayyed, D. Villa, S. Shrivastava, C. Tassie, K. Yoder,
[114] RIKEN Center for Computational Science. (2020). Fugaku: Riken’s A. Bagga, P. Patel, V. Petkov, M. Seltser, F. Restuccia, A. Gosain,
Flagship Supercomputer. [Online]. Available: https://www.fugaku- K. R. Chowdhury, S. Basagni, and T. Melodia, ‘‘Colosseum: Large-
riken.jp/ scale wireless experimentation through hardware-in-the-loop network
[115] K. Sano, A. Koshiba, T. Miyajima, and T. Ueno, ‘‘ESSPER: Elastic emulation,’’ in Proc. IEEE Int. Symp. Dyn. Spectr. Access Netw.
and scalable FPGA-cluster system for high-performance reconfigurable (DySPAN), Dec. 2021, pp. 105–113, doi: 10.1109/DYSPAN53946.2021.
computing with supercomputer Fugaku,’’ in Proc. Int. Conf. High 9677430.
Perform. Comput. Asia–Pacific Region (HPC Asia). New York, NY, [132] Ettus. (2014). USRP Hardware Driver and USRP Manual: USRP
USA: Association for Computing Machinery, 2023, pp. 140–150, doi: X3x0 Series. [Online]. Available: https://files.ettus.com/manual/page_
10.1145/3578178.3579341. usrp_x3x0.html
[116] J. Davis et al., ‘‘BEE3: Revitalizing computer architecture research,’’ [133] NI. (2022). ATCA Overview—NI. [Online]. Available: https://www.
Microsoft, Apr. 2009. [Online]. Available: https://www.microsoft.com/ ni.com/docs/en-US/bundle/atca-3671-getting-started/page/overview.
en-us/research/publication/bee3-revitalizing-computer-architecture- html
research/ [134] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber,
[117] K. Kuusilinna, C. Chang, M. J. Ammer, B. C. Richards, and ‘‘HyperX: Topology, routing, and packaging of efficient large-scale net-
R. W. Brodersen, ‘‘Designing BEE: A hardware emulation engine for works,’’ in Proc. Conf. High Perform. Comput. Netw., Storage Anal.
signal processing in low-power wireless applications,’’ EURASIP J. Adv. (SC), New York, NY, USA, 2009, p. 1. [Online]. Available: http://
Signal Process., vol. 2003, no. 6, pp. 502–513, Dec. 2003. [Online]. dl.acm.org/citation.cfm?doid=1654059.1654101, doi: 10.1145/1654059.
Available: https://www.mathworks.com 1654101.
[118] S. C. Jain, S. Kumar, and A. Kumar, ‘‘Evaluation of various rout- [135] S. Gupta et al. (2022). Getting Started With RFNoC in UHD 4.0—
ing architectures for multi-FPGA boards,’’ in Proc. VLSI Design Ettus Knowledge Base. [Online]. Available: https://kb.ettus.com/Getting_
Wireless Digit. Imag. Millennium 13th Int. Conf. VLSI Design. Started_with_RFNoC_in_UHD_4.0
Washington, DC, USA: IEEE Computer Society, 2000, pp. 262–267. [136] A. Chaudhari and M. Braun, ‘‘A scalable FPGA architecture for flexible,
[Online]. Available: http://ieeexplore.ieee.org/document/812619/, doi: large-scale, real-time RF channel emulation,’’ in Proc. 13th Int. Symp.
10.1109/ICVD.2000.812619. Reconfigurable Commun.-Centric Syst.-on-Chip (ReCoSoC), Jul. 2018,
[119] C. Chang, K. Kuusilinna, B. Richards, and R. W. Brodersen, ‘‘Imple- pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/
mentation of BEE: A real-time large-scale hardware emulation engine,’’ 8449390/, doi: 10.1109/ReCoSoC.2018.8449390.
in Proc. ACM/SIGDA 11th Int. Symp. Field Program. Gate Arrays, [137] J. J. Dongarra and A. J. van der Steen, ‘‘High-performance computing
Feb. 2003, pp. 91–99, doi: 10.1145/611817.611832. systems: Status and outlook,’’ Acta Numerica, vol. 21, pp. 379–474,
[120] C. Chang, J. Wawrzynek, and R. W. Brodersen, ‘‘BEE2: A high-end May 2012, doi: 10.1017/S0962492912000050.
reconfigurable computing system,’’ IEEE Design Test Comput., vol. 22, [138] L. M. Al Qassem, T. Stouraitis, E. Damiani, and I. M. Elfadel,
no. 2, pp. 114–125, Feb. 2005, doi: 10.1109/MDT.2005.30. ‘‘FPGAaaS: A survey of infrastructures and systems,’’ IEEE Trans. Ser-
[121] A. G. Schmidt, B. Huang, R. Sass, and M. French, ‘‘Check- vices Comput., vol. 15, no. 2, pp. 1143–1156, Mar. 2022, doi: 10.1109/
point/restart and beyond: Resilient high performance computing with TSC.2020.2976012.
FPGAs,’’ in Proc. IEEE 19th Annu. Int. Symp. Field-Program. Cus- [139] A. George, H. Lam, A. Lawande, C. Pascoe, and G. Stitt, ‘‘Novo-
tom Comput. Mach., May 2011, pp. 162–169, doi: 10.1109/FCCM. G: A view at the HPC crossroads for scientific computing,’’ in Proc.
2011.22. ERSA, 2010, pp. 21–30. [Online]. Available: http://plaza.ufl.edu/poppyc/
[122] S. Buscemi and R. Sass, ‘‘Design and utilization of an FPGA cluster ERS5029.pdf
to implement a digital wireless channel emulator,’’ in Proc. 22nd Int. [140] D. Gratadour et al., ‘‘Prototyping AO RTC using emerging high
Conf. Field Program. Log. Appl. (FPL), Aug. 2012, pp. 635–638, doi: performance computing technologies with the green flash project,’’ Proc.
10.1109/FPL.2012.6339253. SPIE, vol. 10703, pp. 404–418, Jul. 2018. [Online]. Available: https://
[123] S. Buscemi and R. Sass, ‘‘Design of a scalable digital wireless chan- www.spiedigitallibrary.org/conference-proceedings-of-spie/10703/1070
nel emulator for networking radios,’’ in Proc. Mil. Commun. Conf., 318/Prototyping-AO-RTC-using-emerging-high-performance-computin
Nov. 2011, pp. 1858–1863. [Online]. Available: http://ieeexplore.ieee. g-technologies-with/10.1117/12.2312686.full%20, doi: 10.1117/12.
org/document/6127583/, doi: 10.1109/MILCOM.2011.6127583. 2312686.
[141] A. Mondigo, T. Ueno, K. Sano, and H. Takizawa, ‘‘Comparison [157] M. A. Zapletina and D. A. Zheleznikov, ‘‘The acceleration tech-
of direct and indirect networks for high-performance FPGA clus- niques for the modified pathfinder routing algorithm on an island-
ters,’’ in Applied Reconfigurable Computing. Architectures, Tools, and style FPGA,’’ in Proc. Conf. Russian Young Res. Electr. Electron.
Applications (Lecture Notes in Computer Science: Including Sub- Eng. (ElConRus), Jan. 2022, pp. 920–923. [Online]. Available: https://
series Lecture Notes in Artificial Intelligence and Lecture Notes in ieeexplore.ieee.org/document/9755536/, doi: 10.1109/ElConRus54750.
Bioinformatics), vol. 12083. Springer, 2020, pp. 314–329. [Online]. 2022.9755536.
Available: http://link.springer.com/10.1007/978-3-030-44534-8_24, doi: [158] A. Vaishnav, K. D. Pham, and D. Koch, ‘‘A survey on FPGA
10.1007/978-3-030-44534-8_24. virtualization,’’ in Proc. 28th Int. Conf. Field Program. Log. Appl.
[142] J. D. D. Gazzano, M. L. Crespo, A. Cicuttin, and F. R. Calle, (FPL). Piscataway, NJ, USA: Institute of Electrical and Electronics
Field-Programmable Gate Array (FPGA) Technologies for High Perfor- Engineers, Aug. 2018, pp. 131–138, doi: 10.1109/FPL.2018.
mance Instrumentation. Hershey, PA, USA: IGI Global, Jul. 2016, doi: 00031.
10.4018/978-1-5225-0299-9. [159] K. Fleming, H. Yang, M. Adler, and J. Emer, ‘‘The LEAP FPGA
[143] J. P. Orellana, M. B. Caminero, and C. Carrión, ‘‘Diseño de una arqui- operating system,’’ in Proc. 24th Int. Conf. Field Program.
tectura heterogénea para la gestión eficiente de recursos FPGA en un Log. Appl. (FPL), Sep. 2014, pp. 1–8, doi: 10.1109/FPL.2014.
cloud privado,’’ in Aplicaciones e Innovación de la Ingeniería en Cien- 6927488.
cia y Tecnología. Quito, Ecuador: Abya-Yala, 2019, pp. 165–199, doi: [160] L. Clausing and M. Platzner, ‘‘ReconOS64: A hardware oper-
10.7476/9789978104910.0007. ating system for modern platform FPGAs with 64-bit support,’’
[144] M. Southworth. (Oct. 2021). Choosing the best processor for the job. in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops
Curtis-Wright. [Online]. Available: https://www.curtisswrightds.com/ (IPDPSW), May 2022, pp. 120–127, doi: 10.1109/IPDPSW55747.2022.
sites/default/files/2021-10/Choosing-the-Best-Processor-for-the-Job- 00029.
white-paper.pdf [161] D. Korolija, T. Roscoe, and G. Alonso, ‘‘Do OS abstractions make
[145] M. Qasaimeh, K. Denolf, J. Lo, K. Vissers, J. Zambreno, and P. H. Jones, sense on FPGAs?’’ in Proc. 14th USENIX Symp. Operating Syst.
‘‘Comparing energy efficiency of CPU, GPU and FPGA implementations Design Implement., 2020, pp. 991–1010. [Online]. Available: https://
for vision kernels,’’ in Proc. IEEE Int. Conf. Embedded Softw. Syst. www.usenix.org/conference/osdi20/presentation/roscoe, doi: 10.5555/
(ICESS), Jun. 2019, pp. 1–8, doi: 10.1109/ICESS.2019.8782524. 3488766.3488822.
[146] A. Cicuttin, M. L. Crespo, K. S. Mannatunga, J. G. Samarawickrama, [162] S. Möller et al., ‘‘Community-driven development for computational
N. Abdallah, and P. B. Sabet, ‘‘HyperFPGA: A possible general purpose biology at sprints, hackathons and codefests,’’ BMC Bioinf., vol. 15,
reconfigurable hardware for custom supercomputing,’’ in Proc. Int. Conf. Dec. 2014, Art. no. S7, doi: 10.1186/1471-2105-15-S14-S7.
Adv. Electr., Electron. Syst. Eng. (ICAEES), Nov. 2016, pp. 21–26, doi: [163] M. Pathan et al., ‘‘A novel community driven software for func-
10.1109/ICAEES.2016.7888002. tional enrichment analysis of extracellular vesicles data,’’ J. Extra-
[147] A. Tomori and Y. Osana, ‘‘Kyokko: A vendor-independent high- cellular Vesicles, vol. 6, no. 1, Dec. 2017, Art. no. 1321455, doi:
speed serial communication controller,’’ in Proc. 11th Int. Symp. 10.1080/20013078.2017.1321455.
Highly Efficient Accel. Reconfigurable Technol. New York, NY, USA: [164] M. Kühbach, A. J. London, J. Wang, D. K. Schreiber, F. M. Martin,
Association for Computing Machinery, Jun. 2021, pp. 1–6. [Online]. I. Ghamarian, H. Bilal, and A. V. Ceguerra, ‘‘Community-driven
Available: https://doi-org.ezproxy.cern.ch/10.1145/3468044.3468051, methods for open and reproducible software tools for analyzing datasets
doi: 10.1145/3468044.3468051. from atom probe microscopy,’’ Microsc. Microanal., vol. 28, no. 4,
[148] T. Ueno and K. Sano, ‘‘VCSN: Virtual circuit-switching network for pp. 1038–1053, Aug. 2022. [Online]. Available: https://www.cambridge.
flexible and simple-to-operate communication in HPC FPGA cluster,’’ org/core/product/identifier/S1431927621012241/type/journal_article,
ACM Trans. Reconfigurable Technol. Syst., vol. 16, no. 2, pp. 1–32, doi: 10.1017/S1431927621012241.
Jun. 2023, doi: 10.1145/3579848. [165] R. D. Chamberlain, M. A. Franklin, E. J. Tyson, J. H. Buckley, J. Buh-
[149] T. El-Ghazawi et al., ‘‘Exploration of a research roadmap for ler, G. Galloway, S. Gayen, M. Hall, E. F. B. Shands, and N. Singla,
application development and execution on field-programmable gate ‘‘Auto-pipe: Streaming applications on architecturally diverse systems,’’
array (FPGA)-based systems,’’ George Washington Univ., Washington, Computer, vol. 43, no. 3, pp. 42–49, Mar. 2010, doi: 10.1109/MC.
DC, USA, Tech. Rep. ADA494473, Oct. 2008. [Online]. Available: 2010.62.
https://apps.dtic.mil/sti/citations/ADA494473 [166] Y. Osana, T. Imahigashi, and A. Tomori, ‘‘OpenFC: A portable toolkit
[150] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis, for custom FPGA accelerators and clusters,’’ in Proc. 8th Int. Symp.
J. Wawrzynek, and K. Asanovic, ‘‘Chisel: Constructing hardware in Comput. Netw. Workshops (CANDARW), Nov. 2020, pp. 185–190, doi:
a scala embedded language,’’ in Proc. Design Autom. Conf., 2012, 10.1109/CANDARW51189.2020.00045.
pp. 1216–1225, doi: 10.1145/2228360.2228584. [167] G. Chirkov and D. Wentzlaff, ‘‘SMAPPIC: Scalable multi-FPGA archi-
[151] A. Izraelevitz, J. Koenig, P. Li, R. Lin, A. Wang, A. Magyar, D. Kim, tecture prototype platform in the cloud,’’ in Proc. 28th ACM Int. Conf.
C. Schmidt, C. Markley, J. Lawson, and J. Bachrach, ‘‘Reusabil- Architectural Support Program. Lang. Operating Syst. New York, NY,
ity is FIRRTL ground: Hardware construction languages, compiler USA: Association for Computing Machinery, Jan. 2023, pp. 733–746,
frameworks, and transformations,’’ in IEEE/ACM Int. Conf. Comput.- doi: 10.1145/3575693.3575753.
Aided Design Dig. Tech. Papers. Piscataway, NJ, USA: Institute of
Electrical and Electronics Engineers, Nov. 2017, pp. 209–216, doi:
10.1109/ICCAD.2017.8203780.
[152] C. Baaij, ‘‘CλasH: From Haskell to hardware,’’ Fac. EEMCS. Com-
put. Archit. Embedded Syst., Univ. Twente, Enschede, The Netherlands,
Dec. 2009.
[153] M. Kooijman, ‘‘Haskell as a higher order structural hardware descrip-
tion language,’’ Fac. EEMCS, Comput. Archit. Embedded Syst., Univ.
Twente, Enschede, The Netherlands, Dec. 2009. [Online]. Available:
http://essay.utwente.nl/59381/
[154] M. Mariotti, D. Magalotti, D. Spiga, and L. Storchi, ‘‘The bondmachine, a
WERNER FLORIAN SAMAYOA received the
moldable computer architecture,’’ Parallel Comput., vol. 109, Mar. 2022,
B.S. degree in electronics engineering from the
Art. no. 102873, doi: 10.1016/J.PARCO.2021.102873.
[155] R. S. Molina, V. Gil-Costa, M. L. Crespo, and G. Ramponi,
University of San Carlos, Guatemala, in 2018.
‘‘High-level synthesis hardware design for FPGA-based accelerators: He is currently pursuing the Ph.D. degree
Models, methodologies, and frameworks,’’ IEEE Access, vol. 10, in industrial and information engineering with
pp. 90429–90455, 2022, doi: 10.1109/ACCESS.2022.3201107. the Multidisciplinary Laboratory (MLab), The
[156] Y. Zhou, D. Vercruyce, and D. Stroobandt, ‘‘Accelerating FPGA routing Abdus Salam International Center for Theoretical
through algorithmic enhancements and connection-aware paralleliza- Physics, Universitã degli Studi di Trieste, under the
tion,’’ ACM Trans. Reconfigurable Technol. Syst., vol. 13, no. 4, pp. 1–26, Joint-Supervision Program. His research interest
Dec. 2020, doi: 10.1145/3406959. includes scalable reconfigurable supercomputing.
MARIA LIZ CRESPO is currently a Research Offi- SERGIO CARRATO received the master’s degree
cer with The Abdus Salam International Centre in electronic engineering and the Ph.D. degree in
for Theoretical Physics (ICTP) and an Associate signal processing from the University of Trieste,
Researcher with the Italian National Institute of Trieste, Italy. Then, he was with Ansaldo Com-
Nuclear Physics (INFN), Trieste, Italy. She is also ponenti and Sincrotrone Trieste in the field of
coordinating the Research and Training Program, electronic instrumentation for applied physics.
Multidisciplinary Laboratory (MLab), ICTP. She He joined the Department of Electronics, Univer-
has organized several international schools and sity of Trieste, where he is currently an Associate
workshops on fully programmable systems on chip Professor in electronic devices.
for nuclear and scientific instrumentation. She
is the coauthor of more than 100 scientific publications in prestigious
peer-reviewed journals. Her research interests include advanced scientific
instrumentation for particle physics experiments and experimental multidis-
ciplinary research.
Open Access funding provided by ‘Università degli Studi di Trieste’ within the CRUI CARE Agreement