0% found this document useful (0 votes)
14 views51 pages

1 s2.0 S1383762122001138 Main

The document presents a comprehensive survey on hardware accelerators, detailing their taxonomy, emerging trends, challenges, and future research directions. It categorizes around 100 accelerators from the last decade into four macro-categories: general aspects, host coupling, architecture, and software aspects, while analyzing their performance metrics. The survey emphasizes the shift from general-purpose processors to specialized accelerators in response to the limitations of multicore designs and the growing demand for efficient computing solutions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views51 pages

1 s2.0 S1383762122001138 Main

The document presents a comprehensive survey on hardware accelerators, detailing their taxonomy, emerging trends, challenges, and future research directions. It categorizes around 100 accelerators from the last decade into four macro-categories: general aspects, host coupling, architecture, and software aspects, while analyzing their performance metrics. The survey emphasizes the shift from general-purpose processors to specialized accelerators in response to the limitations of multicore designs and the growing demand for efficient computing solutions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Journal of Systems Architecture 129 (2022) 102561

Contents lists available at ScienceDirect

Journal of Systems Architecture


journal homepage: www.elsevier.com/locate/sysarc

A survey on hardware accelerators: Taxonomy, trends, challenges, and


perspectives
Biagio Peccerillo a ,∗, Mirco Mannino a , Andrea Mondelli b , Sandro Bartolini a
a
Department of Information Engineering and Mathematics, University of Siena, Italy
b
Huawei Technologies Research & Development (UK) Ltd, Cambridge, UK

ARTICLE INFO ABSTRACT

Keywords: In recent years, the limits of the multicore approach emerged in the so-called “dark silicon” issue and
Accelerators diminishing returns of an ever-increasing core count. Hardware manufacturers, out of necessity, switched
Domain-Specific Architectures their focus to accelerators, a new paradigm that pursues specialization and heterogeneity over generality and
Survey
homogeneity. They are special-purpose hardware structures separated from the CPU with aspects that exhibit
Taxonomy
a high degree of variability. We define a taxonomy based on fourteen of these aspects, grouped in four macro-
Classification
Data-parallel
categories: general aspects, host coupling, architecture, and software aspects. According to it, we categorize
Machine Learning around 100 accelerators of the last decade from both industry and academia, and critically analyze emerging
PIM trends. We complete our discussion with throughput and efficiency figures. Then, we discuss some prominent
CGRA open challenges that accelerators are facing, analyzing state-of-the-art solutions, and suggesting prospective
Open challenges research directions for the future.
Future research directions

1. Introduction
A similar scenario emerged in HPC, in embedded and, in gen-
Around fifteen years ago, hardware manufacturers had to abandon eral, in energy-constrained contexts, due to the difficulty to keep on
the path set by Dennard scaling thirty years before due to growing improving performance per Watt metrics of general-purpose proces-
leak currents, power dissipation, and wire delays [1,2]. Further im- sors [7]: the support of a wide range of workloads requires significant
provements to the single-core architecture became increasingly harder investment of micro-architecture/die-area/cache/energy and, due to
to achieve, to the point that they were forced to turn their attention to the interaction between the effects of the end of Dennard scaling and
new design paradigms. wire-delay issues, decreases the potential computational density of
The first broadly adopted solution was the so-called “multicore”, general-purpose chips [4]. The limit becomes operations per die, and
which involves the coexistence of multiple cores on the same die [3].
due to the unavoidable power-envelope constraints, it translates into
Switching to a multicore design helped computer architects to continue
limited performance per Watt.
taking advantage of transistor shrinking and their ever-increasing count
A possible way to overcome these limits is to relax the general-
per unit area, but the limits of this solution manifested soon:
purpose trait, so as to leverage the possibility to customize the hard-
• the difficulty in dissipating thermal power made the use of all the ware to perform specific tasks. For instance, this happened in the GPU
available cores together at full speed inconvenient [4,5]; domain, which steadily grew in the last decade: some non-functional
• the diminishing returns of an increased core count depicted by structures (e.g., big caches) are reduced and the number of functional
Amdahl’s law [6] sets an upper limit to the number of cores that
ones, directly involved in computations, are increased accordingly.
can be fruitfully employed in most practical applications.
This enabled an on-chip HW parallelism that nowadays surpasses that
These factors together made clear that the multicore approach alone is of general-purpose processors by as much as 2 orders of magnitude
not sufficient to satisfy the worldwide growing demand for computing (e.g., thousands of cores in high-end GPUs vs few tens of cores). Today,
power and performance. many hardware manufacturers are following this path and adopting

∗ Corresponding author.
E-mail address: [email protected] (B. Peccerillo).

https://doi.org/10.1016/j.sysarc.2022.102561
Received 9 April 2022; Accepted 12 May 2022
Available online 24 May 2022
1383-7621/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Fig. 1. Overview of the survey.

these design principles, investing in what are called Domain-Specific became fundamental to minimize performance penalties associated
Architectures (DSAs) or accelerators [8]. with pipeline stalls and flushes [3]. Not to tell the enormous amount of
caching structures that nowadays general-purpose processors employ,
1.1. Accelerators: characteristics and advantages trying to exploit locality and keeping the cores steadily crunching
instructions in the average cases. On the other hand, GPUs, the most
An accelerator can be defined as “a separate architectural substruc- widespread accelerators in the market, aim at compute-intensive work-
ture [. . . ] that is architected using a different set of objectives than the loads where many arithmetic instructions are performed on big sets
base processor, where these objectives are derived from the needs of a of data, with high degrees of data-parallelism. Under the load of
special class of applications” [9]. It is not meant as a replacement for these workloads, the die-area and complexity investments in sophis-
the base processor: a CPU in the system is still necessary to execute O.S. ticated mechanisms like out-of-order execution and branch prediction
tasks and orchestrate execution on the accelerator itself. In principle, pay off far less than dedicating such area to additional units directly
the CPU is sufficient to execute also the workloads demanded to the involved in the computations. Moreover, such a simpler design dras-
accelerator, generally called the “accelerated workloads” [10]. The tically reduces the energy per operation compared to general-purpose
reason why the latter is preferred lays in the promised improvement of architectures [13].
non-functional requirements such as compute throughput and/or low
latency and/or energy efficiency. This, above all, is the main reason 1.2. Scope of this work
why accelerators are such a hot topic today.
Accelerators are not new in the computing landscape. Historically, Thanks to their special-purpose nature, accelerators can be designed
they have been used for various tasks, but their adoption remained more freely than general-purpose processors. By relaxing the generality
limited. Even the most promising accelerator became obsolete in a trait, original, even exotic design choices can be exploited as long
few years and was surpassed by the growing performance of general- as they prove advantageous in the reduced domain for which the
purpose processors, thus making the investment “not worth it” [9,11]. accelerator is intended. These choices are highly dependent on the
This is no longer true: on the contrary, they are regarded as the most characteristics of the accelerated work, its algorithmic features, the
promising driving force of computer architecture [4], the only viable required form factor, etc. This variety affects not only architecture,
solution to satisfy the expanding computing needs of both private users but also the software ecosystems of languages, compilers, libraries, and
and companies. Summarizing the words by Hennessy and Patterson [4] frameworks that revolve around accelerators.
and Dally et al. [12], the real value of accelerators lies in: Due to this intrinsic heterogeneity, it is not easy to grasp the
plethora of aspects that characterize accelerators: the domain in which
• More efficient and specialized forms of parallelism;
they operate, their hardware components, whether they are
• More effective use of the memory hierarchy, with high-
programmable or not, the granularity at which they operate, the
bandwidth, low-cost, low-energy local/optimized memories;
characteristics and level of maturity of their software ecosystem, the
• Data specialization, to do in one cycle what would need several
way they are connected to the host system, their relation with the
cycles on a CPU;
host O.S., etc. These aspects are often intertwined, to the point that
• Reduced fetch and decode overhead, due to hardware specializa-
a comprehensive analysis of accelerators should discuss them together,
tion;
from a holistic point of view.
• Support for Domain Specific Languages (DSLs), to leverage the
This survey discusses accelerators in such spirit, considering sev-
underlying architecture more efficiently.
eral aspects that characterize them. First, we provide a multi-level
Since accelerators are specialized hardware, finely tuned to the taxonomy, organized in four macro-categories: general aspects, host
specific needs of a class of applications, they are not bound to the coupling, architecture, and software aspects, and according to it we
design constraints that are a property of general-purpose hardware. classify numerous accelerators from the last decade. The value of this
Computer architects may explore countless techniques that would not classification is twofold: on the one hand, we help the readers to get
pay off in the general case, but could fit the special case, providing synthetic information about each accelerator quickly and position it
measurable advantages. in this multi-dimensional design space; on the other hand, we give
For instance, it is well-known that CPUs evolved to leverage an ever- them a comprehensive overview of the accelerator world as a whole and
increasing instruction-level parallelism (ILP), resorting to the replica- highlight emerging trends. We enrich the discussion with throughput
tion of execution-units (superscalar paradigm) and pipelines with a and efficiency figures for data-parallel and ML-oriented accelerators,
growing number of stages. With them, area- and power-hungry fea- and then discuss our findings in relation to the classification done.
tures such as out-of-order execution and complex branch prediction Finally, we discuss some open challenges that accelerators are facing,

2
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Fig. 2. A visual representation of the proposed taxonomy, with the classification criteria grouped in four macro-categories.

analyze some proposed state-of-the-art solutions, and suggest promising


research directions as future perspectives.
In summary, the main contributions of this work are:

• A complete, multi-level taxonomy of accelerators organized in


four macro-categories: general aspects, host coupling, architec-
ture, and software aspects.
• A categorization of several accelerators from industry and
academia and a comprehensive discussion of emerging trends.
• A comparison of throughput and efficiency figures of many data-
parallel and ML-oriented accelerator models.
• An in-depth analysis of open challenges faced by accelerators
as a whole, state-of-the-art solutions to them, and prospective
directions for future research.

The remainder of this paper is organized as follows: in Section 2,


we present our taxonomy, in Section 3, we use it to do a thorough
classification of several accelerators and discuss the trends that emerge
from it. In Section 4, we discuss throughput and efficiency figures. In
Section 5, we discuss challenges, state-of-the-art solutions, and prospec-
tive research directions. In Section 6, we present some studies that
share our view and focus on accelerators in general. We conclude in
Section 7. Table 1 lists acronyms and abbreviations used throughout
the paper. Fig. 1 provides an overview of the paper. Fig. 3. The General Aspects macro-category.

2. Taxonomy presentation
2.1. General aspects
In order to describe accelerators comprehensively, we propose four-
teen criteria and group them into four macro-categories. Fig. 2 gives a
graphic representation of them. This macro-category, shown in Fig. 3, includes aspects that describe
In this section, we thoroughly discuss the classification criteria and an accelerator with a very high level of abstraction and permit to
present the possible values for each of them. They are descriptive of quickly contextualize it.
the variety of accelerators we have found and categorized, but they are
not intended as a complete coverage of the design space. Unavoidably,
some possibilities are not listed because they are not representative of 2.1.1. Accelerator year
the accelerators we took into consideration. Others have been omitted The accelerator’s launch year, where available, otherwise its an-
on purpose because they are too specific of a single model and would nouncement year. For academic projects, it is usually the year of
lead to a fragmentation of the relevant information. In the following, publication of the first paper describing the accelerator. In this work,
we acknowledge some exclusions for the sake of completeness. we restrict the scope to the last decade: 2013–2022.

3
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Table 1 Table 1 (continued).


Acronyms and abbreviations used throughout the paper. TDP Thermal Design Power
AI Artificial Intelligence TLB Translation Lookaside Buffer
ADC Analog-to-Digital Converter TPU Tensor Processing Unit
ANN Artificial Neural Network UVM Unified Virtual Memory
APU Application Processing Unit VLIW Very-Long Instruction Word
AR Augmented Reality VM Virtual Machine
ASIC Application-Specific Integrated Circuit WGA Whole Genome Alignment
BNN Binarized Neural Network
CCIX Cache Coherent Interconnect for Accelerators
CGRA Coarse-Grained Reconfigurable Architecture
CNN Convolutional Neural Network ASIC chip designed to perform a specific application. As such, they
CXL Compute Express Link
DAC Digital-to-Analog Converter
offer the most efficient solution, but it comes at the cost of no
DDK Device Development Kit reconfigurability, no programmability, and high NRE costs.
DDR Double Data Rate
DFG Data-Flow Graph FPGA reconfigurable accelerators with a large number of logic blocks
DIMM Dual Inline Memory Module (up to millions [14]), memory cells, and specialized blocks
DLRM Deep Learning Recommendation Model (e.g., DSP, I/O, etc.) interconnected. Both blocks and intercon-
DNN Deep Neural Network
nect allow a fine-grain reconfigurability that supports a high va-
DSA Domain-Specific Architecture
DSL Domain-Specific Language riety of applications, making them suitable for general-purpose
DSP Digital Signal Processing computing [15,16]. Reconfiguration time is generally high, in
eDRAM Embedded DRAM the order of ms-s [17,18], as it involves the writing of config-
FLOPS Floating-Point Operations per Second uration data into a non-volatile memory (generally an off-chip
FPGA Field-Programmable Gate Array
Flash memory), which is later used to feed on-chip memory cells
FSM Finite-State Machine
GAN Generative Adversarial Network (generally SRAM cells) when the FPGA is powered on [17,19].
GEMM GEneral Matrix–Matrix multiplication
GPGPU General Purpose computing on GPU Spatial spatial accelerators feature an array of relatively simple ALU-
GPU Graphics Processing Unit like PEs that can communicate directly through NoCs, buses or
HBM High-Bandwidth Memory ad-hoc inter-PE connections [20]. These PEs can form a chain
HDL Hardware Description Language and exchange data each other in a dataflow fashion. Each PE
HLS High-Level Synthesis
HMC Hybrid Memory Cube
can have its own control logic and memory [21].
HPC High-Performance Computing
IDE Integrated Development Environment CGRA special case of spatial accelerators, organized as arrays of re-
IoT Internet of Things configurable PEs connected through a reconfigurable intercon-
IOMMU I/O Memory Management Unit nect. They offer a coarse-grain reconfigurability, which makes
IOTLB I/O Translation Lookaside Buffer them domain-specific flexible. Their reconfiguration time is gen-
JIT Just-In-Time
erally faster than FPGAs’, usually in the order of ns-μs, mak-
k-NN k-Nearest Neighbors
LBM Lattice-Boltzmann Method ing them eligible for temporal computations as well as spa-
LLC Last-Level Cache tial computations. They support both configuration-driven and
LPDDR Low-Power Double Data Rate dataflow-driven execution [18].
LSTM Long Short-Term Memory
MAC Multiply-Accumulate Systolic in systolic accelerators, the main computation is performed
MCDRAM Multi-Channel DRAM by a 1D or 2D array of arithmetic units controlled in lockstep.
MIMD Multiple Instruction Multiple Data
Input data enter the array in the first level, each arithmetic unit
ML Machine Learning
MLP Multi-Layer Perceptron applies a fixed function to its input and passes its output to arith-
MMIO Memory-Mapped I/O metic units in the following level, considered downstream. Each
NFA Nondeterministic Finite Automata level produces a partial result which flows through the array
NIC Network Interface Controller in a wave-like fashion until the last level, which produces the
NNP Neural Network Processor
total result as output [7]. Systolic accelerators are also a subset
NoC Network-on-Chip
NPU Neural Processing Unit of spatial ones, with two main restrictions: ALUs are not pro-
NRE Non-Recurring Engineering grammable, but perform a fixed function on the input, and data
OAM Open Accelerator Module flow in a fixed direction during elaboration (i.e., downstream).
OCP Open Compute Project
OPS Operations per Second Vector accelerators where the computation is performed by vector
PE Processing Engine
processors, each with one or more vector cores. Each vector core
PIM Processing-In-Memory
PoD Point of Delivery
has vector registers that can hold tens or hundreds of primitive
RNN Recurrent Neural Network elements (e.g., 32-bit floating point values) and apply single
RTL Register Transfer Level instructions to all the elements in the vector register. They rely
SFU Special Function Unit heavily on pipelining and have the ability to load/store from/to
SIMD Single Instruction Multiple Data
the memory both contiguous and strided data [7].
SPE Synergistic Processing Element
STT-MRAM Spin-Transfer Torque Magnetic RAM
Manycore accelerators featuring many hundreds or even thousands
SVM Support Vector Machine
of independent physical cores, sometimes grouped in tens of
(continued on next page)
multi-processors. The manycore paradigm can be regarded as
the natural evolution of the multi-core paradigm, with an even
higher degree of parallelism that is particularly suited for data-
2.1.2. Accelerator type parallel workloads. For the cores, a simpler design is preferred
It synthetically specifies the accelerator’s type, intended as its broad- (in-order, no branch prediction, etc.) to pack them more densely
est classification from an architectural point of view. We identify the in the chip area and leverage as much purely computational
following types: resources as possible.

4
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

GPU GPUs are the most notable instances of manycore accelerators, Dataflow computation paradigm in which problems are expressed
but they have also similarities with vector accelerators. For their in terms of processing nodes and data flowing between them,
importance in today’s market and for their specific architectural usually by the means of a DataFlow Graph (DFG) [26–28]. The
features, we employ a distinct category for them within the operations are applied to the operands as soon as they are
manycore umbrella. They feature up to tens of multi-processors available on the processing node.
made up of simple, in-order cores that exploit massive thread-
level parallelism [22]. These are similar to independent MIMD General-purpose execution of a program with no particular structure,
cores, with each multi-processor acting as a multi-threaded vec- constraints, or application domain.
tor core. The main difference with classic vector cores lays, in
fact, in the massive employment of multi-threading rather than Narrow domains, or application areas:
deep pipelines, with threads mapped onto physical lanes [7].
ML training execution of a machine learning model with the aim of
PIM this type of accelerator brings computation closer to memory, in establishing the value of the model parameters [21].
order to reduce the energy needed to move data and increase
memory bandwidth. The term refers to accelerators providing ML inference execution of a machine learning model whose parame-
both near-memory and in-memory computing. In the former case, ters are already established in a previous training phase [21].
compute engines are placed near to classic DRAM or SRAM
cells, which retain their role unaltered. In the latter, memory AR real time direct or indirect view of the physical real world en-
cells are re-purposed to perform computing logic on stored data, vironment that has been enhanced by the addition of virtual
also with the help of emerging memory technologies such as computer-generated information [29].
ReRAM and STT-MRAM that expose desirable characteristics in
this sense [23]. Computer Vision field of the AI that aims to build autonomous sys-
tems that could perform some tasks which the human visual
A note is needed for the class of Spatial Accelerators. Although
system can perform, like image/video analysis and interpreta-
it is a super-set of FPGAs, CGRAs, and Systolic Accelerators, in this
tion [30].
paper, we label as Spatial Accelerators only those that cannot be
described as any of the three. This way, we always prefer the most spe-
cific definition available, using FPGA, CGRA, or Systolic where fitting Cryptography field of computer science that aims to transform mes-
and reserving Spatial only to those accelerators with interconnected sages in order to make the message unintelligible to all but the
non-reconfigurable PEs with flexible, often programmable data flow intended receiver of the message [31].
direction. We adopt a categorization similar to that expressed in [24].
We use the same approach for Manycore and GPUs: although the Data (De)Compression field of computer science that studies algo-
latter is a special case of the former, we use the GPU terms where rithms that aim to decrease the number of bytes used to rep-
appropriate and Manycore for accelerators that are not GPUs. resent data, while preserving (lossless) or reducing (lossy) their
quality.
2.1.3. Accelerator origin
Database Manipulation manipulation of information stored in a
Whether the accelerator originates from Industry or Academia.
database system, intended as an organized collection of struc-
Industrial accelerators are commercial products. They can be made
tured information, typically stored electronically in a computer
available on the market to end-users by being sold individually or
as computing resources in data-centers. Academic accelerators, con- system [32].
versely, are usually prototypes described in scientific journals and
conference proceedings. Genome Sequence Alignment alignment of sequences of nucleotide
bases, read from a DNA molecule with a “genome sequencing”
operation [33].
2.1.4. Accelerator domain
It refers to the class of applications that can be executed on the
Graph Processing manipulation of data organized in a graph struc-
accelerator. Thus, by specifying that an accelerator operates in/pertains
ture.
to a particular domain, we intend that said accelerator can execute (or
accelerate) applications pertaining to that domain. We identify eighteen
Graphics field of computer science that aims to communicate visually
domains. These are not mutually exclusive, as there are overlaps. In this
through a computer display and its interaction devices [34].
sense, “Data-parallel”, “Dataflow”, and “General-purpose” are broad
domains, and the others are narrow domains, as highlighted in Tables 5.
The former are computation paradigms, and as such they are more Multimedia processing of streams of data of multiple content forms,
ample. They may include several applications from the latter, which such as audio, video, speech, text, etc.
can be also described as application areas. Consequently, accelerators
capable of targeting broad domains present a high flexibility, and are NFA computation of finite state automata in which more than one
certainly able to execute workloads from one or more narrow ones. For transition is possible from one state [35].
such accelerators, we specify the broad domain to which they belong,
and also some narrow ones that include the most notable applications Network Flow Encryption set of operations used to encrypt data
executed on that accelerator. flowing in input or output through the network card [36].
Broad domains, or computation paradigms:
Pattern Matching set of operations used to test an expression in order
Data-parallel computation paradigm in which problems can be ex- to check the presence/absence of a given pattern.
pressed in terms of collections of elements that can be processed
concurrently, by applying the same stream of instructions to Search Acceleration set of operations used to deliver relevant web
each data element [25]. documents to users with consistently low response times [37].

5
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

bump-in-the-wire special placement between two existing compo-


nents, achieved by connecting input/output wires of the com-
ponents to those of the accelerator.

CCIX cache-coherent interconnect expressed as a set of specifications


defined by the CCIX Consortium. It provides a non-proprietary
standard to ensure chip-to-chip communication between dif-
ferent devices in heterogeneous systems, with high bandwidth
and low latency [46,47]. It enables the expansion of the OS
paged memory with the memory of the accelerator, creating a
cache-coherent virtual memory space.

DRAM slot standard motherboard slot where RAM modules are in-
serted. Accelerators using this connection are designed as mem-
ory DIMMs with integrated in-memory computing units.
Fig. 4. The Host Coupling macro-category.
HMC high-performance single-package containing four or eight DRAM
dies stacked together with a computing logic layer [48,49]. It
The presence of a “General-purpose” domain seems in contrast can be connected to the host in various ways, e.g., DIMM slots
with the special-purpose nature of accelerators. However, it takes into or Intel QuickPath Interconnect [50].
account those devices that are used as accelerators, i.e., connected to
integrated on-chip the accelerator is part of a bigger SoC that in-
a host system where a general-purpose processor executes an O.S. and
cludes at least a general-purpose CPU.
regulates execution on the device, but have no constraints on the kind
of applications supported. For instance, it is the case of FPGAs, as they
M.2 flexible specification for external devices that supports PCIe,
have the ability to model even a general-purpose CPU [15,16].
SATA, and USB buses.
Other domains exist besides those listed above. Some examples can
be found in [38–45]. However, in this work we aim to pragmatically OCP OAM flexible high-speed interconnect based on a mezzanine form
focus and cover the vast majority of domains as to allow discussing factor, with a focus on interoperability between different com-
a solid classification of existing accelerators and their features. The ponents [51].
remaining accelerator domains could reasonably fit in some listed
domains in any case or very close to them. PCI-e high-speed expansion bus, the most common motherboard inter-
face for external devices.
2.1.5. Programmability
Every accelerator gives the possibility to perform some computa- network interconnect standard physical connection for network de-
tions on user-provided input data to produce output data and return vices. Accelerators using this connection are complex devices,
them back to the user. To what extent these computations can be generally designed as PoDs.
customized characterizes the accelerator’s programmability: if users are
USB most common industry standard to connect external peripherals.
limited to trigger a predefined operation on an input stream, we say
that the accelerator is not programmable. Conversely, we say that an no details no further details about the accelerator connection are
accelerator is programmable if it has the ability to interpret instruc- disclosed.
tions, usually provided as binary code, and perform different operations
based on them.
2.2.2. Software interface
It describes the set of operations that can be performed in code
2.1.6. Reconfigurability in order to interact with the accelerator. The perspective is that of
An accelerator is classified as reconfigurable if there exists a the application developer: in order to write accelerated code, they
software-based procedure to alter the logic functions performed by its need to perform some operations to setup and start execution on the
hardware. E.g., FPGAs and CGRAs, where software techniques can be accelerator. We identify the following possible operations:
used to change the operations performed by the logic elements and the
interconnection between them. Reconfigurable accelerators are always Memory management accelerator memory allocation/freeing and
equipped with various blocks of programmable logic, (re)configured by data movement from/to the host memory to/from the accelera-
users after manufacturing. Reconfiguration is generally achieved by tor memory.
writing a bit-stream of configuration data into a configuration memory,
which is then used to feed several on-chip memory cells that are Soft configuration software configuration of the accelerated task. It
responsible for regulating the behavior of single components. may involve a parameter writing that tunes a pre-defined oper-
ation or the loading into the accelerator of a binary representing
2.2. Host coupling the operation.

Programmable Logic configuration software-based procedure to


In this category, shown in Fig. 4, we put all those aspects that
configure the programmable logic of a reconfigurable device,
describe how the accelerator is connected to the rest of the system, the
generally expressed in RTL/HDL.
so-called host.
Kernel launch invocation and execution of a user-defined kernel func-
2.2.1. Connection strategy tion on the accelerator.
It is the way the accelerator is physically connected to the host.
We identify the following possibilities, spanning from the tighter and Operation execution single operation execution, generally achieved
close-range couplings up to the more loose and distant ones: through a command packet sending.

6
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Fig. 6. The Special-purpose Resources category, from Architecture macro-category.

Fig. 5. The Architecture macro-category.

2.3.1. General-purpose resources


They are hardware components that can be normally found in a
2.2.3. Memory sharing
general-purpose system. We group them into Memories and Computing
It specifies whether the accelerator shares any level of the memory engines. We identify the following:
hierarchy with the host, and can access that memory level with no Memory:
intervention from the CPU. It refers to contexts in which the accelerator
acts as a node accessing the memory on-par with the CPU, and it may • L1, L2, L3: cache levels;
even be involved in coherence protocols. We identify the following • DDR3, DDR4: DRAM memories;
• LPDDR4, LPDDR4X: low-power DRAM memories.
possibilities:
Computing engine:
none accelerator and host do not share any level of the memory
hierarchy; • ARM CPU: ARM-based general-purpose processor;
• x86 CPU: x86-based general-purpose processor;
LLC accelerator has direct access to the host’s LLC; • Scalar unit: simple general-purpose processor, usually lightweight
(e.g., in-order, single-issue) with unspecified ISA.
DRAM accelerator has direct access to the host’s DRAM; Other memory resources exist, but they have been replaced by more
recent technologies (e.g., DDR2, LPDDR2, LPDDR3) or are not used yet
no details no further details about the memory sharing are disclosed. (e.g., DDR5). In both cases, they are not present in the accelerators
we classified. As for computing engines, we distinguish between ARM,
x86 and other/unspecified ISAs because the first two are notable exam-
2.2.4. OS coupling
ples that characterize today’s general-purpose systems. We put in the
Some accelerators need to be recognized by the Operating System general scalar unit category other possibilities, like RISC-V, MIPS, etc.
to properly work, while others can be simply connected to the system
and work transparently. In the first case, OS responsibilities may vary, 2.3.2. Special-purpose resources
as it may be responsible for providing to the accelerator handles to The accelerators’ peculiar hardware components are put under this
particular resources, or signaling the accelerator in response of specific category. It includes those resources that have been designed to explic-
events, or even managing its internal state between context switches. itly fit the needs of the accelerated applications, following principles
Overall, we identify the following possible OS coupling strategies: that are not those adopted for general-purpose components. We group
them further into Memories, Computing engines, Fixed-function com-
none the accelerator can interact with the system transparently, no OS ponents, and Memory & Computing engines to indicate those compo-
awareness is needed; nents that provide both memory and computing functions. We identify
the following:
Memory:
drivers user-space and/or kernel-space drivers need to be installed;
• GDDR6, GDDR6X graphics-oriented DRAM technologies;
custom OS the accelerator needs a custom OS installed on the host in • HBM, HBM2, MCDRAM, HMC-RAM: high speed, 3D-stacked
order to work properly; DRAM memory technologies with through-silicon vias and micro-
bumps [48,49,52,53];
no details no further details about the OS coupling are disclosed. • eDRAM: embedded DRAM, integrated on the same die of the
accelerator;
• Local memory: on-chip SRAM memory; both addressable, with
2.3. Architecture
content explicitly managed software-side (scratchpad), and non-
addressable, transparently used to store partial results, configura-
As specified previously, the accelerators’ Architecture can offer a tion parameters, or instructions.
remarkable diversity. To better highlight analogies and differences with
Computing engine:
general-purpose systems, we divide the hardware components in two
categories, one for the analogies and one for the differences. Fig. 5 • Spatial array: 1D or 2D array of interconnected PEs/arithmetic
shows this macro-category, and Fig. 6 highlights the Special-purpose units that process data and exchange them independently of each
Resources. other;

7
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

• Vector processor: processor that executes single instructions on


arrays of data, which includes both state-of-the-art vector proces-
sors and SIMD processors [7];
• Systolic array: 1D or 2D array of PEs/arithmetic units that process
data in lockstep and pass them downstream;
• Tensor core: single core with a 2D or 3D array of arithmetic units,
generally specialized in MAC operations.

Fixed-function:

• Rendering pipeline engine: pipeline including all the required


operations to render 3D models to 2D screen;
• Physics engine: computational unit specialized in physical behav-
ior emulation;
• ML auxiliary-functions engine: computational unit specialized in
auxiliary ML operations (e.g., activation functions);
• Cryptography engine: computational unit specialized in crypto-
graphic operations (e.g., hashing, cipher/decipher);
• AD/DA converter: analog-to-digital/digital-to-analog converter;
• Compr./Decompr. engine: computational unit specialized in data Fig. 7. The Software Aspects macro-category.
compression and decompression;
• Alignment engine: computational unit specialized in genomics-
related alignment operations; HDL used to describe digital circuits, it can be employed to
• DB Operations engine: computational unit specialized in common (re)configure the programmable logic of reconfigurable accelera-
database manipulation operations. tors, as previously defined;
Programmable Logic:
assembly accelerator’s equivalent to general-purpose processor’s as-
• Fine-grain: permits general-purpose flexibility, with highly cus- sembly language;
tomizable hardware and interconnections between components.
It is typical of FPGAs. libraries collections of functions and classes that invoke accelerated
• Coarse-grain: the overall structure is fixed and dictated by code under the hood, they can be used in high-level source-code
domain-specific flexibility. It is typical of CGRAs, that present a to gain access to the accelerator;
2D array of reconfigurable PEs with a reconfigurable intercon-
nect, usually less complex than FPGAs’ interconnect [18]. high-level languages accelerator’s equivalent to general-purpose pro-
cessor’s programming languages, they permit generating accel-
Memory & Computing:
erator code by compiling/interpreting high-level source code.
• ReRAM: non-volatile memory that exploits the change of its
resistance to store data [54]; 2.4.2. Ecosystem
• STT-MRAM: non-volatile memory that uses magnetic tunnel junc- This category is closely related to the previous one. It lists the
tions whose magnetic layers can change value by spin-polarized particular frameworks, compilers, libraries, etc. available to program-
currents [55]; mers and end-users that intend to leverage the accelerator. There
• In-DRAM logic: computing units placed within DRAM memory; are both open-source (e.g., [56–64]) and vendor-provided examples
• in-SRAM logic: computing units placed within SRAM memory; (e.g., [65–73]).
• Near memory logic: computing units placed close to, but not
within, memory.
2.4.3. Granularity
Also in this case, we omit some memory resources that have been It describes the kind of task that is demanded to the accelerator. It
replaced by more recent technologies, like GDDR5 and previous it- is tightly related to the level of flexibility supported. We identify three
erations, because they are not used in the categorized accelerators. of them, sorted by increasing complexity:
We merge software-controlled local memories, often called scratchpads,
and buffer-like memories into the single category “local memory”: Wired task single operation wired into the accelerator hardware. Gen-
although it is interesting to distinguish between the two, we usually erally, it can be “activated” with a command packet.
have not enough information to tell how a local memory is managed. As
for computing engines/fixed-function, we omit some possibilities like Function-level user-defined function, as intended in the context of C,
specialized controllers and other IP blocks that perform very specific C++ and similar high-level programming languages. It is not
functions. necessarily invoked as-is, but can be part of a more complex
execution flow: e.g., it may be the activation function of a neural
2.4. Software aspects network’s neurons, a function performed by each node of a
graph on its input data, etc.
The Software Aspects macro-category, shown in Fig. 7, summarizes
the software characteristics of the accelerators. Application-level a full application can be executed on the accelera-
tor.
2.4.1. Programming layer
It specifies the tools programmers can use to program, configure,
or communicate with the accelerator. These are representative of the 3. Accelerator classification
abstraction level which programmers can leverage and, consequently,
how productively they can write accelerated code. We identify the In this section, we classify several accelerators according to the
following possibilities, sorted by increasing level of abstraction: criteria presented in the previous section.

8
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Table 2
List of references of accelerator families used to fill the classification tables. For each, the models considered for performance figures are shown.
Accelerator family Models References
AMD Radeon RX 5000 Series Radeon RX 5300, 5600 XT, 5700 XT [74–77]
AMD Radeon RX 6000 Series Radeon RX 6800 XT, 6900 XT [78–80]
ARM Mali Bifrost [81–85]
ARM Mali Valhall [81,86,87]
Intel FPGA 10 Series Intel Stratix 10NXa [14,88–97]
Intel FPGA F-Series [98–100]
Intel FPGA V Series [101–104]
Intel Graphics Technology (Gen11) UHD Graphics G1, Iris Plus Graphics G4, G7 [105–108]
Intel Graphics Technology (Xe-LP) UHD Graphics 730, 770, Iris Xe Graphics G7 [109–112]
Intel Nervana NNP-I 1100, 1300 NNP-I 1100, 1300 [113–117]
Intel Nervana NNP-T 1300, 1400 NNP-T 1300, 1400 [115,118–121]
Intel Xeon Phi Knights Landing Xeon Phi KL7210, KL7250, KL7290 [122–126]
Intel Xeon Phi Knights Mill Xeon Phi KM7235, KM7285, KM7295 [125,127–129]
NEC SX-Aurora TSUBASA Vector Engine Type10 Type 10A, 10B, 10C [130–133]
NEC SX-Aurora TSUBASA Vector Engine Type20 Type 20A, 20B [130,131,133,134]
NVIDIA GeForce RTX 20xx GeForce RTX 2060, 2060T,b 2080, 2080T,b Titan, TitanTb [135–137]
NVIDIA GeForce RTX 30xx GeForce RTX 3060, 3060T,b 3070Ti, 3070TiT,b 3090, 3090Tb [138–141]
Xilinx FPGA 7 Series [142–146]
Xilinx FPGA Ultrascale+ Series [147–151]
Xilinx Versal ACAP [152–154]
a
AI-optimized blocks.
b
Tensor cores.

3.1. Methodology includes an Ultrascale FPGA [147]. For our purposes, we describe the
Ultrascale FPGA reconfigurable accelerator, rather than the sparse CNN
Many accelerators have been proposed during the years, probably accelerator prototyped on it.
starting with the Intel 8087 Math Coprocessor [289], announced in The vastness of the topic here discussed and the level of detail we
1980 for PCs. A treatise comprehending all the accelerators proposed want to consider for each accelerator, necessarily sets some limitations
in both industry and academia during more than 40 years is outside of this work. First, despite our research effort, there could be some
the scope of our work. We are interested in depicting the current accelerators missing from our study. In order to mitigate the impact
situation of the accelerator world, in order to provide a useful guide to of these omissions, we categorize several accelerators from many do-
professionals with different backgrounds, as stated in the Introduction. mains, so to have a good coverage of the whole spectrum and increase
Also, we are interested in identifying trends and determining successful the probability that no fundamental aspects are overlooked. Second,
features and open challenges to complete the “big picture”. In order during the review phase of our work and after its publication, other
to do so, we also need to take into consideration recent accelerators. accelerators will likely be proposed, both as commercial models or
We analyze accelerators proposed in the last 10 years (i.e., not before as research proposals. We mitigate the impact of their absence from
2013), knowing that some fundamental ones that had a big impact in our analysis by identifying trends that will likely continue in the near
the computing world are not included, like the aforementioned 8087 future, and shape the accelerators to come. Third, for some accelerators,
(1980), the Cell B.E. SPE [290] (2005), or the GeForce 256 (1999), not all the aspects that we deem relevant are defined/disclosed, and
that was dubbed “the world’s first GPU” [291]. thus we cannot categorize them fully according to our taxonomy. To
Due to the level of detail we adopt in this paper, we must necessarily
handle these cases, we provide a “no details” value for the missing
analyze accelerator models or, when enough traits are in common,
aspects.
families. For instance, we cannot categorize “GPUs”, as the term is
In the following, Tables 5, 6, 7 and 8 present the accelerator classifi-
too broad for our purposes. On the opposite side, a categorization
cation in accordance to the discussed taxonomy. Each table covers one
that distinguishes between single models is often not necessary, as
macro-category. Accelerator rows with the same values are grouped to
we can safely describe the family they belong to, highlighting differ-
highlight similarities.
ences and factorizing analogies. Thus, we categorize e.g., GeForce RTX
Tables 2 and 3 list the references related to the classified accel-
20xx [135], AMD Radeon 5000 Series [74], ARM Mali Bifrost [85], and
erators and the considered models. By concentrating all of them in
so on, which are families of accelerators, including models with many
two specific tables, we avoid cluttering the classification tables with
traits in common. A distinction between single models is done mainly in
references that would hamper readability.
performance-oriented discussions. Moreover, for accelerators belonging
to the same “evolution line”, e.g., GeForce GTX 10xx, GeForce RTX
20xx, and GeForce RTX 30xx, we pick the last two generations, also 3.2. Classification
when more than two have been proposed in the last decade. Lastly,
since some accelerator families feature tens of models with few differ- In the following, we classify more than 100 accelerators and families
ences, we take into consideration at most three different models, chosen of accelerators. Table 5 presents their general aspects, Table 6 presents
in order to cover a wide spectrum – e.g., entry-level, middle-tier, and their host coupling details, Table 7 presents their architecture, and
high-end/enthusiast. Table 8 presents their software aspects. We comment the trends that
FPGAs are treated as a special case. They are highly-flexible recon- emerge from the tables.
figurable accelerators, able to accelerate tasks from different domains
and with different characteristics, as a significant body of work testifies 3.2.1. General aspects
– e.g., [292–299]. In many research works, they are used to imple- Table 5 shows an overall picture of the accelerator landscape of the
ment accelerator proposals as prototypes, or soft accelerators. In this last decade.
survey, we discuss them as accelerators per se, rather than considering Type. Accelerator designers have fully embraced the PIM approach,
the single accelerator instances that have been obtained as particular deciding to tackle the memory wall problem [300] by moving compu-
configurations. For instance, in [293], Zhu et al. describe an acceler- tation closer to the memory, sometimes even augmenting the memory
ator for sparse CNNs implemented on a Xilinx ZCU102 board, which itself with processing capabilities, as we show in Table 7. These are

9
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Table 3 Table 3 (continued).


List of references of accelerators used to fill the classification tables. Accelerator References
Accelerator References
PuDianNao [181,255]
AHA371/372 [155,156] PUMA [256]
AHA374/378 [157–159] PX-CGRA [257]
AHA604/605 [160] Q100 [258,259]
Apple A13 Bionic Neural Engine [161–163] Qualcomm Hexagon 698 DSP [260–262]
Apple A14 Bionic Neural Engine [164–166] Qualcomm Hexagon 780 DSP [262–264]
Baidu Kunlun K200 [167,168] RADAR [265]
BioSEAL [169] RAPID [266]
Cambricon-X [170] REMUS [267]
CASCADE [171] RobustInMem [268]
Cerebras WSE [172–174] Samsung Exynos 9825 NPU [269,270]
Coral USB Accelerator [175–178] Samsung Exynos 990 NPU [271]
CSRAM [179] Sandwich-RAM [272]
DaDianNao chipa [180,181] ShiDianNao [181,273]
Darwin [33] Softbrain [10]
Darwin-WGA [182] SparseReRAM [274]
DianNao [181,183] Spin-transfer [275]
DIMA Inference Processor [184] TensorNode [276]
DIMA-CNN [185] Tesla FSD Computer NPU [277,278]
DRISA [186] Tesseract [279]
DUAL [187] TETRIS [280]
Eyeriss [20,188] Time [281]
Eyeriss v2 [189] Untether TsunAImi [282–284]
FlexFlow [190] UPMEM PIM [285,286]
FloatPIM [191] X-CGRA [287]
FPSA [192] YodaNN [288]
GenCache [193]
a
Google Pixel Visual Core [7,194,195] Single chip.
Google TPU v2 [7,196–199]
Google TPU v3 [7,196–199]
Graphcore IPU-M2000 [200–202]
GraphH [203]
been employed for ML (both inference and training), Cryptography,
Graphicionado [204]
GraphP [50] Genomics, Graph processing, Data-parallel, NFA, and Pattern matching.
GraphQ [205] We will discuss the PIM type further in Section 5.1.
GraphR [206] After PIM, the most represented type is the Manycore accelerator.
GRIM-Filter [207] Unlike PIM, the manycore approach is well-established and has been
GroqCard [208–210]
Hailo-8 [211]
successfully applied over the years: it is perceived as the natural
Heterogeneous-PIM [212] evolution of the multi-core approach, with an even more extreme
HReA [213] commitment to parallelism. As such, it can count many industrial
HRL [214] accelerators, usually dedicated to data-parallel and ML workloads, like
Huawei Atlas 200 AI [215–218]
Huawei’s products based on the DaVinci architecture [218] or Intel’s
Huawei Atlas 300I [216–219]
Huawei Atlas 300T [216–218,220] Nervana [113,118] and Habana [227,228] products.
Huawei Kirin 9000x NPU [218,221–223] Its most prominent declination, i.e., the GPU, is arguably the most
Huawei Kirin 990x NPU [218,223–225] common type of accelerator: they are no longer limited to graphics, as
IMP [226] they are now fully-fledged data-parallel accelerators. Despite the few
Intel Habana Labs HL-100 [227]
Intel Habana Labs HL-20x [228]
entries in the table that reflect the market dominance of only three
ISAAC [229] producers (AMD, ARM, and NVIDIA), GPUs have a prominent role in
LerGAN [230] mobile, desktop, data-center, and HPC market segments.
Micron Automata Processor [231–235] Around one third of the categorized accelerators are designed as
Microsoft Project Catapult [36,236,237]
arrays of ALUs/PEs interconnected. Systolic accelerators are the most
Mixed-signal [238]
Multibit [239] abundant, and they are employed to target ML workloads, both in-
NAND-Net [240] ference and training. Designers can rely on them to accelerate NNs,
NDA [241] which constitute the majority of ML workloads accelerated today (as
NEST [242] shown in Table 4), because there is a topological similarity that can be
Neural Cache [243]
Neurocube [244]
exploited by mapping NN’s layers on levels of the systolic array, neu-
NonVolCIM [245] rons on PEs, and connections between layers on connections between
NP-CGRA [246] levels. Moreover, they are particularly suitable to map matrix–matrix
Origami [247,248] multiplications [196], and thus convolutions, which are the heart of
PipeLayer [249]
CNNs, the most common and most performance-demanding NN type.
Plasticine [250]
PLiM [251] CGRAs are gaining increasing attention from the community. Their
PracticalNDP [252] interesting characteristics (i.e., high efficiency, fast reconfigurability,
PRIME [253] support for both spatial and temporal computations) make them a good
PROMISE [254]
fit for various workloads, especially in the dataflow area. HReA, a
(continued on next page) CGRA accelerator for the 13-Dwarfs [213], proves that CGRAs, despite
their domain-specific flexibility, can still be flexible enough to accelerate
a high variety of real-world tasks. However, they still present some
usually experimental proposals coming from academia, with only Mi- limitations and challenges that relegate them to academic proposals, as
cron Automata Processor [231,232] and UPMEM PIM [285] coming the absence of industrial CGRAs in Table 5 highlights. We will deepen
from industry. The table also highlights their versatility, as they have the discussion on these promising architectural solutions in Section 5.2.

10
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Table 4
Supported workloads in detail for accelerators in the ML domain.
Accelerator ML workload
Apple A13, A14 Bionic Neural Engine DNN
Baidu Kunlun K200 DNN
Cambricon-X DNN (Sparse NNs, CNN)
CASCADE CNN, RNN
Cerebras WSE CNN
Coral USB Accelerator CNN, RNN, MLP
CSRAM CNN
DaDianNao chip CNN
DianNao CNN
DIMA Inference Processor SVM, Template matching, k-NN, Matched filter
DIMA-CNN CNN
DRISA CNN, RNN
DUAL Unsupervised ML
Eyeriss v1, v2 CNN
FlexFlow CNN
FloatPIM CNN
FPSA ANN
Google Pixel Visual Core CNN
Google TPU v2, v3 CNN, RNN, MLP
Graphcore IPU-M2000 CNN
GroqCard CNN
Hailo-8 DNN
Heterogeneous-PIM CNN, GAN
HRL DNN
Huawei Atlas 200 AI DNN
Huawei Atlas 300I, 300I Pro DNN
Huawei Atlas 300T DNN
Huawei Kirin 900x, 9000x NPU DNN
Intel Habana Labs HL-100 DNN
Intel Habana Labs HL-20x DNN
Intel Nervana NNP-I 1100, 1300 DNN
Intel Nervana NNP-T 1300, 1400 DNN
ISAAC CNN
LerGAN GAN
Microsoft Project Catapult DNN
Mixed-signal BNN
Multibit CNN
NAND-Net CNN, BNN
Neural Cache CNN
Neurocube CNN
NonVolCIM CNN
NP-CGRA CNN
NVIDIA GeForce RTX 20xx, 30xx DNN (CNN, MLP, RNN), ML (LinReg, Lasso, ElasticNet, k-NN)
Origami CNN
PipeLayer CNN
Plasticine k-means
PracticalNDP CNN, MLP
PRIME CNN, MLP
PROMISE SVM, k-NN, DNN, LinReg, Template matching
PuDianNao k-NN, k-means, DNN, SVM, LinReg, Naive Bayes, Classification tree
PUMA MLP, LSTM, CNN, Lin/Log-Reg, SVM
Qualcomm Hexagon 698, 780 DSP DNN
RobustInMem SVM
Samsung Exynos 9825, 990 NPU CNN
Sandwich-RAM CNN
ShiDianNao CNN
SparseReRAM CNN
Spin-transfer k-means, k-NN, MLP, SVM
Tesla Full Self Driving Computer NPU DNN
TensorNode CNN, RNN, MLP
TETRIS CNN
Time CNN, MLP
Untether TsunAImi DNN (Vision, MLP, Recommendation)
UPMEM PIM DLRM, CNN, k-means, Decision tree
Xilinx Versal ACAP CNN
YodaNN BNN

Spatial accelerators are used for inference tasks in all the cases, and All FPGA accelerators in the table come from industry. FPGA market
training in some cases. Some models are also suited for generic dataflow is similar to GPU market: few manufacturers, Intel and Xilinx in this
workloads, as spatial accelerators, like CGRAs, offer a natural way to case, are responsible for several products that span a high variety
map DFGs: PEs can implement the processing associated to nodes, and of form factors, from chips integrated into development boards to
the path on the interconnection can serve as the arc between them. In PCI-e cards. They have the ability to accelerate any task, thanks to
a couple of accelerators, i.e., Neurocube [244] and TETRIS [280], a a fine-grain reconfigurability that can mimic even a general-purpose
spatial array of PEs has been implemented in the logic die of an HMC processor [15,16].
stack [48].

11
B. Peccerillo et al.
Table 5
Accelerators’ general aspects.
Domain
Broad Narrow

Network Flow Encryption


Genome Seq. Alignment
Database Manipulation
Data (De)Compression

Search Acceleration
Graph Processing

Pattern Matching

Reconfigurability
Computer Vision

Programmability
General-purpose

Cryptography
ML Inference
Data-parallel

ML Training

Multimedia
Dataflow

Graphics

NFA
AR
Accelerator Year Type Origin

AHA 371, 372 2014


ASIC Ind ✕ N N
374, 378 2015
AHA 604, 605 2016 ASIC Ind ✕ ✕ N N
AMD Radeon RX 5000 Series 2019
GPU Ind ✕ ✕ ✕ Y N
6000 Series 2020
ARM Mali Bifrost 2016
GPU Ind ✕ ✕ ✕ Y N
Valhall 2019
Apple Bionic A13 2019
-a Ind ✕ Y N
Neural Engine A14 2020
Baidu Kunlun K200 2021 Manycore Ind ✕ ✕ ✕ Y N
BioSEAL 2020 PIM Acad ✕ Y N
Manycore
12

Cambricon-X 2016 Acad ✕ Y N


Systolicb
PIM
CASCADE 2019 Acad ✕ Y N
Systolicc
Cerebras WSE 2019 Spatial Ind ✕ ✕ ✕ Y N
Coral USB Accelerator 2019 Systolic Ind ✕ Y N
CSRAM 2018 PIM Acad ✕ N N
Manycore
DaDianNao chip 2014 Acad ✕ ✕ Y N
Systolicb
Darwin 2018 ASIC Acad ✕ N N
Darwin-WGA 2019 ASIC Acad ✕ N N
DianNao 2014 Systolic Acad ✕ Y N
DIMA Inference Processor 2018 PIM Acad ✕ Y N
DIMA-CNN 2018 PIM Acad ✕ N N
DRISA 2017 PIM Acad ✕ ✕ N Y

Journal of Systems Architecture 129 (2022) 102561


DUAL 2020 PIM Acad ✕ Y N
Eyeriss v1 2016
Spatial Acad ✕ ✕ Y N
v2 2019
FlexFlow 2017 Spatial Acad ✕ ✕ ✕ Y N
FloatPIM 2019 PIM Acad ✕ ✕ Y N
FPSA 2019 PIM Acad ✕ N Y
GenCache 2019 PIM Acad ✕ N N
Google Pixel Visual Core 2017 Manycore Ind ✕ ✕ ✕ Y N
Google TPU v2 2017
Systolic Ind ✕ ✕ Y N
v3 2018
Graphcore IPU-M2000 2020 Manycore Ind ✕ ✕ ✕ Y N
(continued on next page)
B. Peccerillo et al.
Table 5 (continued).
Domain
Broad Narrow

Network Flow Encryption


Genome Seq. Alignment
Database Manipulation
Data (De)Compression

Search Acceleration
Graph Processing

Pattern Matching

Reconfigurability
Computer Vision

Programmability
General-purpose

Cryptography
ML Inference
Data-parallel

ML Training

Multimedia
Dataflow

Graphics

NFA
AR
Accelerator Year Type Origin
GraphH 2019 PIM Acad ✕ Y N
Graphicionado 2016 Manycore Acad ✕ Y N
GraphP 2018 PIM Acad ✕ Y N
GraphQ 2019 PIM Acad ✕ Y N
GraphR 2018 PIM Acad ✕ Y N
GRIM-Filter 2018 PIM Acad ✕ N N
GroqCard 2020 Spatial Ind ✕ ✕ Y N
Hailo-8 2019 Spatial Ind ✕ Y N
Heterogeneous-PIM 2018 PIM Acad ✕ ✕ Y N
HReA 2018 CGRA Acad ✕ N Y
PIM
HRL 2016 Acad ✕ ✕ ✕ N Y
CGRAd
Huawei Atlas 200 AI 2019 Manycore Ind ✕ ✕ Y N
Huawei Atlas 300I, 300I Pro 2019 Manycore Ind ✕ ✕ Y N
Huawei Atlas 300T 2019 Manycore Ind ✕ ✕ Y N
Huawei Kirin 990x NPU 2019
Manycore Ind ✕ Y N
13

9000x NPU 2020


IMP 2018 PIM Acad ✕ Y N
Intel FPGA 10 Series 2013 FPGA Ind ✕ N Y
Intel FPGA F-Series 2019 FPGA Ind ✕ N Y
Intel FPGA V Series 2013 FPGA Ind ✕ N Y
Intel Graphics Gen11 2019
GPU Ind ✕ ✕ Y N
Technology Xe-LP 2020
e
Intel Habana Labs HL-100 2019 Manycore Ind ✕ ✕ Y N
Intel Habana Labs HL-20x 2019 Manycore Ind ✕e ✕ Y N
Intel Nervana
2019 Manycore Ind ✕ Y N
NNP-I 1100, 1300
Intel Nervana
2019 Manycore Ind ✕ Y N
NNP-T 1300, 1400
Intel Xeon Landing 2016

Journal of Systems Architecture 129 (2022) 102561


Manycore Ind ✕ Y N
Phi Knights Mill 2017
ISAAC 2016 PIM Acad ✕ Y N
LerGAN 2018 PIM Acad ✕ ✕ Y N
Micron Automata Processor 2014 PIM Ind ✕ ✕ N Y
Microsoft Project Catapult 2015 FPGA Ind ✕ ✕ ✕ N Y
Mixed-signal 2018 PIM Acad ✕ N N
Multibit 2019 PIM Acad ✕ N N
NAND-Net 2019 PIM Acad ✕ N N
PIM
NDA 2015 Acad ✕ N Y
CGRAf
NEC SX-Aurora Type10 2018
Vector Ind ✕ Y N
TSUBASA V.E. Type20 2020
NEST 2020 PIM Acad ✕ N N
Neural Cache 2018 PIM Acad ✕ N N
(continued on next page)
B. Peccerillo et al.
Table 5 (continued).
Domain
Broad Narrow

Network Flow Encryption


Genome Seq. Alignment
Database Manipulation
Data (De)Compression

Search Acceleration
Graph Processing

Pattern Matching

Reconfigurability
Computer Vision

Programmability
General-purpose

Cryptography
ML Inference
Data-parallel

ML Training

Multimedia
Dataflow

Graphics

NFA
AR
Accelerator Year Type Origin
PIM
Neurocube 2016 Acad ✕ ✕ Y N
Spatialg
NonVolCIM 2018 PIM Acad ✕ N N
NP-CGRA 2021 CGRA Acad ✕ N Y
NVIDIA GeForce RTX 20xx 2018
GPU Ind ✕ ✕ ✕ ✕ ✕ Y N
RTX 30xx 2020
Origami 2015 Systolic Acad ✕ N N
PipeLayer 2017 PIM Acad ✕ ✕ Y N
Plasticine 2017 CGRA Acad ✕ ✕ ✕ N Y
PLiM 2016 PIM Acad ✕ ✕ Y N
14

PracticalNDP 2015 PIM Acad ✕ ✕ ✕ Y N


PRIME 2016 PIM Acad ✕ Y N
PROMISE 2018 PIM Acad ✕ Y N
PuDianNao 2015 Systolic Acad ✕ ✕ Y N
PUMA 2019 PIM Acad ✕ Y N
PX-CGRA 2018 CGRA Acad ✕ ✕ N Y
Q100 2014 ASIC Acad ✕ N N
Qualcomm 698 DSP 2020
Vector Ind ✕ ✕ ✕ ✕ Y N
Hexagon 780 DSP 2021
RADAR 2018 PIM Acad ✕ N N
RAPID 2019 PIM Acad ✕ N N
REMUS 2013 CGRA Acad ✕ N Y
RobustInMem 2018 PIM Acad ✕ ✕ N N
Samsung Exynos 9825 NPU 2019
Systolic Ind ✕ Y N

Journal of Systems Architecture 129 (2022) 102561


990 NPU 2020
Sandwich-RAM 2019 PIM Acad ✕ N N
ShiDianNao 2015 Systolic Acad ✕ Y N
Softbrain 2017 CGRA Acad ✕ N Y
SparseReRAM 2019 PIM Acad ✕ Y N
Spin-transfer 2017 PIM Acad ✕ Y N
Tesla Full Self Driving
2019 Systolic Ind ✕ Y N
Computer NPU
TensorNode 2019 PIM Acad ✕ Y N
(continued on next page)
B. Peccerillo et al.
Table 5 (continued).
Domain
Broad Narrow

Network Flow Encryption


Genome Seq. Alignment
Database Manipulation
Data (De)Compression

Search Acceleration
Graph Processing

Pattern Matching

Reconfigurability
Computer Vision

Programmability
General-purpose

Cryptography
ML Inference
Data-parallel

ML Training

Multimedia
Dataflow

Graphics

NFA
AR
Accelerator Year Type Origin
Tesseract 2015 PIM Acad ✕ Y N
PIM
TETRIS 2017 Acad ✕ Y N
Spatialg
15

Time 2018 PIM Acad ✕ ✕ Y N


Untether TsunAImi 2021 Spatial Ind ✕ Y N
UPMEM PIM 2020 PIM Ind ✕ ✕ ✕ ✕ Y N
X-CGRA 2020 CGRA Acad ✕ ✕ N Y
Xilinx FPGA 7 Series 2017 FPGA Ind ✕ N Y
Xilinx FPGA Ultrascale+ Series 2015 FPGA Ind ✕ N Y
Xilinx Versal ACAP 2019 FPGA Ind ✕ ✕ ✕ ✕ Y Y
YodaNN 2018 Systolic Acad ✕ N N
a
No details available
b
Many PEs, each being a systolic array of arithmetic units.
c
Systolic array of cascaded ReRAM MAC arrays.
d
CGRA implemented in the logic die into an HMC stack.
e Fully programmable Tensor Processing Cores.

Journal of Systems Architecture 129 (2022) 102561


f CGRA integrated near conventional DRAM.

g Spatial array of PEs implemented in the logic die of an HMC stack.


B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

The Genome Sequence Alignment domain is populated mainly by


Although some concepts related to vector computing are present in PIM accelerators. Its applications are characterized by large datasets
other types of accelerators (e.g., GPUs), the only examples of classic that involve hundreds of millions of DNA base pairs. With the PIM
vector accelerators are Qualcomm Hexagon DSPs [262] and NEC SX- design, the processing can happen where data are stored, avoiding a
Aurora TSUBASA Vector Engines [130]. The former are integrated massive data transfer altogether that would limit bandwidth and energy
in mobile SoCs and operate in inference, AR, and Computer Vision efficiency.
domains, while the latter are shipped as PCI-e cards and can accelerate Table 5 shows that some accelerators are capable of accelerating
any data-parallel task. tasks even in the general-purpose domain. As specified before, FP-
Accelerators labeled as ASIC are non-programmable, non- GAs are able to do that thanks to their fine-grain reconfigurability
reconfigurable, and are tailored on targeted workloads. Q100, in par- in both control logic and interconnect. CGRAs, which are coarse-grain
ticular, is conceived as 11 tiles aggregated in a single ASIC, each reconfigurable, usually target more specific domains, but this is not
performing a particular Database-related function [258]. the case of HReA [213]. It has been designed with a “hybrid-grained
Domain. As for accelerators’ domains, Table 5 testifies the “Cam- datapath” [213] that supports data-intensive applications as well as
brian explosion” of ML-oriented proposals. Most accelerators in the control intensive ones, like dynamic programming, backtracking, and
table belong to this category, with inference being around three times FSM.
more represented than training. This gap may be due to the fact that Dataflow is a computation paradigm that puts data at the center and
training, with respect to inference, has generally higher requirements expresses a program in terms of the operations performed on them,
in terms of storage capacity and arithmetic precision. More storage is highlighting the concurrency between operations. Since its first pro-
needed because intermediate results need to be preserved; while higher posal as a form of program representation by Dennis and Misunas [27],
precision is needed because of the higher requirements of gradient- a “data-flow processor” was proposed, arguing that the classic von
based optimization techniques, needed to calculate the optimal weight Neumann design would not be a good fit. In this processor, the memory
values [21]. is organized in instruction cells equipped with three registers that hold
ML is a multifaceted domain that comprehends a vast body of algo- an instruction and two operands.
rithmic solutions. The important role that AI and ML occupy in today’s As soon as the operands are available, the content of the registers is
society and the ever-growing interest of the research community in this forwarded to an operation unit that applies the operation indicated by
field requires an increased level of detail. Thus, we highlight the class the instruction and forwards the result to a destination cell. Today’s
of ML workloads that are commonly targeted by each accelerator in Dataflow accelerators from Table 5 adopt solutions inspired by the
Table 4. DNNs, especially CNNs, are the most abundant. They have data-flow processor, but more advanced: four of them are Spatial
two main characteristics that make them particularly amenable for architectures (Cerebras WSE [172], Eyeriss [20], Eyeriss v2 [189], and
acceleration: FlexFlow [190]), and three are CGRAs (PX-CGRA [257], Softbrain [10],
X-CGRA [287]). The instruction cells have been replaced by PEs that
• they are employed in several AI applications like speech recog-
include the capabilities of the operation units as well, the interconnec-
nition, image recognition, self-driving cars, cancer detection, and
tion network allows a direct forward of the result to the next PE, and
gaming [21], thus a DNN accelerator, despite being a specialized
the instruction-driven operation has been replaced by a reconfigurable
device, is still versatile enough to accelerate many real-world
solution in the CGRAs.
applications;
Different types of accelerator target Graph processing: Spatial,
• they have a high computational complexity, which makes tech-
Manycore, CGRA, and PIM. The latter is the preferred approach in
niques like hardware acceleration an important strategy to im-
seven out of the ten considered cases. Typical graph processing tasks
prove energy efficiency and throughput and expand the opportu-
that would benefit from acceleration, like big-data analytics and PageR-
nities for DNN deployment [21].
ank, are defined by large-scale graphs, poor locality, and random
The second most represented domain is Data-parallel (around 16% access patterns [50,203], with memory being the bottleneck for both
of cases), which includes diverse real-world applications that span performance and efficiency. Therefore, the PIM approach becomes
various fields: linear algebra (GEMM, convolution, triangular factor- an amenable solution: it has the capacity to avoid data movements
ization), graphics and imaging (rendering, filtering), physical model- altogether, keeping only those related to communication between
ing and simulation (weather forecasting, molecular dynamics, LBM), processing elements.
finance (Black–Scholes, option pricing), Monte Carlo simulations, etc. Programmability/Reconfigurability. The most common combina-
These applications are characterized by reduced control logic and tion inherent to accelerators’ programmability and reconfigurability is
the employment of the same processing steps on a multitude of el- ‘‘programmable, non-reconfigurable’’. Most accelerators, despite being
ements in the managed dataset. Therefore, they are a good match domain-specific, still offer flexibility in the form of programmabil-
for dedicated hardware accelerators: control logic can be minimized ity, i.e., the ability to perform operations according to user-provided
in favor of computing units (e.g., ALUs, FPUs, simple PEs), that are instructions. However, around one fifth of the accelerators are inflex-
replicated in pursue of massive parallelism. ible. Generally speaking, an inflexible design may be a good choice
With ever-increasing datasets that emerge from such a diverse range in terms of throughput and energy efficiency, but it may limit an
of applications, the prominence of energy efficiency and throughput accelerator’s lifetime and applicability. In our opinion, all the non-
grows accordingly, making hardware acceleration even more impor- programmable, non-reconfigurable accelerators analyzed so far are
tant. Table 5 shows that various accelerators target this domain: all rather protected against this risk: those in the Genome Sequence Align-
the GPUs; some Manycore accelerators like Baidu Kunlun K200 [168], ment domain accelerate a task that offers little to no degrees of free-
Intel Habana Labs [227,228], and Intel Xeon Phi [125,126]; some PIM dom, while those in the Data (De)Compression domain accelerate
proposals like DRISA [186], PLiM [251], and UPMEM PIM [285]; two well-established algorithms that have not changed in years (e.g., zlib
PIM/CGRA accelerators; the NEC SX-Aurora TSUBASA vector accelera- and gzip algorithms).
tor [130]; and Plasticine [250], a CGRA. The many tiles of the mentioned Q100 [258] act as different func-
Multimedia can count on various accelerator proposals. Image/ tional units that offer a good coverage of the possible workloads.
video filtering is a task usually performed by GPUs, but there are also Finally, as Table 4 highlights, those dedicated to ML inference are all
specific proposals that target this domain. For instance, the Huawei dedicated to BNN/CNN acceleration (except RobustInMem [268]) and
Atlas accelerators [215,219,220], Qualcomm Hexagon DSPs [262], or retain the possibility to employ user-defined weights, which is sufficient
three CGRAs (PX-CGRA [257], X-CGRA [287], REMUS [267]). to support a high number of different scenarios and use cases.

16
B. Peccerillo et al.
Table 6
Accelerators’ host coupling.
Connection SW interface Memory sharing OS coupling

network interconnect
integrated on-chip

soft configuration
bump-in-the-wire

PL configuration
memory mgmt

kernel launch

op execution
DRAM slot

custom OS
OCP OAM

no details

no details

no details
drivers
DRAM
PCI-e
HMC
CCIX

none

none
USB
M.2

LLC
Accelerator
AHA371, AHA372, AHA374, AHA378 ✕ ✕ ✕ ✕
AHA604, AHA605 ✕ ✕ ✕ ✕
AMD Radeon RX 5000 Series,
✕ ✕ ✕ ✕ ✕
6000 Series
ARM Mali Bifrost, Valhall ✕ ✕ ✕ ✕ ✕
Apple A13 Bionic Neural Engine,
✕ ✕ ✕ ✕ ✕
A14 Bionic Neural Engine
Baidu Kunlun K200 ✕ ✕ ✕ ✕ ✕ ✕ ✕
BioSEAL ✕ ✕ ✕ ✕ ✕
Cambricon-X ✕ ✕ ✕ ✕ ✕
CASCADE ✕ ✕ ✕ ✕ ✕ ✕
Cerebras WSE ✕ ✕ ✕ ✕
Coral USB Accelerator ✕ ✕ ✕ ✕ ✕
CSRAM ✕ ✕ ✕ ✕ ✕
DaDianNao chip ✕ ✕ ✕ ✕ ✕ ✕
Darwin ✕ ✕ ✕ ✕
Darwin-WGA ✕ ✕ ✕ ✕ ✕
17

DianNao ✕ ✕ ✕ ✕ ✕ ✕
DIMA Inference Processor ✕ ✕ ✕ ✕ ✕ ✕
DIMA-CNN ✕ ✕ ✕ ✕ ✕ ✕
DRISA ✕a ✕a ✕ ✕ ✕ ✕
DUAL ✕ ✕ ✕ ✕ ✕ ✕
Eyeriss ✕ ✕ ✕ ✕ ✕
Eyeriss v2 ✕ ✕ ✕ ✕ ✕
FlexFlow ✕ ✕ ✕ ✕
FloatPIM ✕ ✕ ✕ ✕ ✕ ✕
FPSA ✕ ✕ ✕ ✕ ✕ ✕ ✕
GenCache ✕ ✕ ✕ ✕
Google Pixel Visual Core ✕ ✕ ✕ ✕ ✕ ✕
Google TPU v2, v3 ✕ ✕ ✕ ✕ ✕
Graphcore IPU-M2000 ✕ ✕ ✕ ✕ ✕ ✕

Journal of Systems Architecture 129 (2022) 102561


GraphH ✕ ✕ ✕ ✕ ✕ ✕
Graphicionado ✕ ✕ ✕ ✕ ✕
GraphP ✕ ✕ ✕ ✕ ✕
GraphQ ✕ ✕ ✕ ✕ ✕
GraphR ✕ ✕ ✕ ✕ ✕
GRIM-Filter ✕ ✕ ✕ ✕
GroqCard ✕ ✕ ✕ ✕ ✕
Hailo-8 ✕ ✕ ✕ ✕ ✕ ✕
Heterogeneous-PIM ✕ ✕ ✕ ✕ ✕ ✕ ✕
HReA ✕ ✕ ✕ ✕ ✕
HRL ✕ ✕ ✕ ✕ ✕
Huawei Atlas 200 AI ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
Huawei Atlas 300I, 300I Pro, 300T ✕ ✕ ✕ ✕ ✕ ✕ ✕
Huawei Kirin 990x NPU, 9000x NPU ✕ ✕ ✕ ✕ ✕ ✕ ✕
(continued on next page)
B. Peccerillo et al.
Table 6 (continued).
Connection SW interface Memory sharing OS coupling

network interconnect
integrated on-chip

soft configuration
bump-in-the-wire

PL configuration
memory mgmt

kernel launch

op execution
DRAM slot

custom OS
OCP OAM

no details

no details

no details
drivers
DRAM
PCI-e
HMC
CCIX

none

none
USB
M.2

LLC
Accelerator
IMP ✕ ✕ ✕ ✕ ✕
Intel FPGA V Series, 10 Series, F-Series ✕ ✕ ✕ ✕ ✕ ✕ ✕
Intel Graphics Technology Gen11, Xe-LP ✕ ✕ ✕ ✕ ✕
Intel Habana Labs HL-100 ✕ ✕ ✕ ✕ ✕
Intel Habana Labs HL-20x ✕ ✕ ✕ ✕ ✕ ✕
Intel Nervana NNP-I 1100, 1300 ✕ ✕ ✕ ✕ ✕ ✕
Intel Nervana NNP-T 1300, 1400 ✕ ✕ ✕ ✕ ✕ ✕
Intel Xeon Phi Knights Landing, Knights Mill ✕ ✕ ✕ ✕ ✕
ISAAC ✕ ✕ ✕ ✕ ✕ ✕
LerGAN ✕ ✕ ✕ ✕ ✕ ✕ ✕
Micron Automata Processor ✕ ✕ ✕ ✕ ✕ ✕
Microsoft Project Catapult ✕ ✕ ✕ ✕ ✕ ✕ ✕
Mixed-signal ✕ ✕ ✕ ✕ ✕ ✕
Multibit ✕ ✕ ✕ ✕ ✕ ✕
18

NAND-Net ✕ ✕ ✕ ✕ ✕ ✕
NDA ✕ ✕ ✕ ✕ ✕ ✕
NEC SX-Aurora TSUBASA
✕ ✕ ✕ ✕ ✕
Vector Engine Type10, Type20
NEST ✕ ✕ ✕ ✕
Neural Cache ✕ ✕ ✕ ✕
Neurocube ✕ ✕ ✕ ✕ ✕ ✕
NonVolCIM ✕ ✕ ✕ ✕ ✕ ✕
NP-CGRA ✕ ✕ ✕ ✕ ✕
NVIDIA GeForce RTX 20xx, 30xx ✕ ✕ ✕ ✕ ✕ ✕ ✕
Origami ✕ ✕ ✕ ✕
PipeLayer ✕ ✕ ✕ ✕ ✕ ✕
Plasticine ✕ ✕ ✕ ✕ ✕
PLiM ✕ ✕ ✕ ✕ ✕ ✕

Journal of Systems Architecture 129 (2022) 102561


PracticalNDP ✕ ✕ ✕ ✕ ✕ ✕
PRIME ✕ ✕ ✕ ✕ ✕ ✕ ✕
PROMISE ✕ ✕ ✕ ✕ ✕ ✕b
PuDianNao ✕ ✕ ✕ ✕ ✕ ✕
PUMA ✕ ✕ ✕ ✕ ✕ ✕ ✕
PX-CGRA ✕ ✕ ✕ ✕ ✕
Q100 ✕ ✕ ✕ ✕
Qualcomm Hexagon 698 DSP, 780 DSP ✕ ✕ ✕ ✕ ✕
RADAR ✕ ✕ ✕ ✕
RAPID ✕ ✕ ✕ ✕ ✕
(continued on next page)
B. Peccerillo et al.
Table 6 (continued).
Connection SW interface Memory sharing OS coupling

network interconnect
integrated on-chip

soft configuration
bump-in-the-wire

PL configuration
memory mgmt

kernel launch

op execution
DRAM slot

custom OS
OCP OAM

no details

no details

no details
drivers
DRAM
PCI-e
HMC
CCIX

none

none
USB
M.2

LLC
Accelerator
REMUS ✕ ✕ ✕ ✕ ✕ ✕
RobustInMem ✕ ✕ ✕ ✕ ✕ ✕
Samsung Exynos 9825 NPU ✕ ✕ ✕ ✕ ✕
Samsung Exynos 990 NPU ✕ ✕ ✕ ✕ ✕
Sandwich-RAM ✕ ✕ ✕ ✕ ✕ ✕
ShiDianNao ✕ ✕ ✕ ✕ ✕
Softbrain ✕ ✕ ✕ ✕ ✕c ✕c ✕b
19

SparseReRAM ✕ ✕ ✕ ✕ ✕ ✕
Spin-transfer ✕ ✕ ✕ ✕ ✕ ✕b
TensorNode ✕ ✕ ✕ ✕ ✕d
Tesla Full Self Driving Computer NPU ✕ ✕ ✕ ✕ ✕
Tesseract ✕ ✕ ✕ ✕ ✕
TETRIS ✕ ✕ ✕ ✕ ✕ ✕
Time ✕ ✕ ✕ ✕ ✕ ✕
Untether TsunAImi ✕ ✕ ✕ ✕ ✕
UPMEM PIM ✕ ✕ ✕ ✕ ✕
X-CGRA ✕ ✕ ✕ ✕ ✕
Xilinx FPGA 7 Series,
✕ ✕ ✕ ✕ ✕ ✕ ✕
Ultrascale+ Series
Xilinx Versal ACAP ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
YodaNN ✕ ✕ ✕ ✕ ✕

Journal of Systems Architecture 129 (2022) 102561


a
Either PCI-e or DRAM slot.
b
CPU ISA extended, no OS intervention necessary.
c
Either LLC or DRAM.
d
NVIDIA drivers and runtime extended to perform operations on TensorNode.
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

solution to trigger execution on non-programmable accelerators, but


The non-programmable, reconfigurable accelerators are all FPGAs can also be the preferred interface for programmable accelerators:
or CGRAs except two PIM accelerators: FPSA [192] and Micron Au- in fact, as long as accelerator-side control flow is not involved, an
tomata Processor [231]. Both of them have reconfigurable routing approach based on library calls wrapping single operations is preferred
architectures that connect ReRAM-crossbar-based PEs in one case and to the development of a fully-fledged programming language and its
State Transition Elements in another. The only programmable, reconfig- necessary tools – a compiler, above all.
urable accelerator is Xilinx Versal ACAP [152], which is an FPGA with Another common action is the explicit memory management on
many specialized reconfigurable blocks (DSP, AI, Crypto, etc.) and two the accelerator. This can be performed at low-level, e.g., with explicit
integrated processors. LOAD and STORE instructions as in the accelerators of the Dian-
Nao family [181], or by invoking memory management API functions
3.2.2. Host coupling like cudaMalloc, cudaFree, and cudaMemcpy calls in the CUDA C++
Table 6 shows how the accelerators are connected to the rest extension for NVIDIA GPUs [65].
of the (host ) system. Unfortunately, many accelerators, mainly from Soft configuration is a widespread solution to manage accelerators
Academia, are not described at this level of detail. In scientific pa- in contexts where little flexibility is needed. This does not reflect
pers, this aspect is often overlooked as authors usually concentrate on necessarily the capability of the accelerator, as highly flexible, even
the architectural details and the achieved metrics in terms of power, Turing-complete accelerators can provide a soft configuration proce-
efficiency, throughput, and area. dure as a convenient interface. For instance, many accelerators recur to
Connection strategy. Connection strategy describes their physical it to perform ML inference tasks: a soft configuration procedure takes
connection to the host. Between the accelerators that give this infor- care of loading a model into the accelerator, and a corresponding op
mation, about 36% of them use PCI-e as their connection strategy. execution runs the model on the input.
In desktop and server computers, PCI-e is the most common way to Kernel launch causes the execution of a function on the accelerator.
connect external peripherals like graphic cards, network cards, etc. The kernel is always expressed in a high-level programming language
Its widespread use, testified by the presence of ad-hoc slots on every and is compiled with an accelerator-specific compiler. Therefore, ac-
desktop/server motherboard, makes it a “safe choice” that ensures celerators supporting this operation offer a high degree of flexibility.
maximum compatibility with existing systems. Two solutions related to
A kernel launch operation does not necessarily execute the kernel
PCI-e appear in the table: CCIX, which augments PCI-e with a unified,
function as-is or as the executable body of a multitude of threads (à-la
coherent virtual space, and M.2, which uses PCI-e data transfer lanes
data-parallel model), but can be part of a more complex structure. For
on the motherboard.
instance, in Graph processing accelerators like Graphicionado [204],
Right after PCI-e, on-die integration is the most common solution
GraphP [50], and GraphQ [205], the kernel function is executed as a
(about 26%). It is particularly suitable for mobile systems, that rely
node function in a user-defined graph.
on complex SoCs that already integrate a CPU and a variety of ac-
PL configuration concerns the act of reconfiguring the
celerators like GPUs and NPUs [161,164,221,224,260,263]. We expect
programmable logic of a reconfigurable accelerator. FPGAs and CGRAs
this strategy to become more common in the future, as SoCs have
support this procedure, although with different languages and time
some interesting characteristics (e.g., low latency, high integration,
scales. Generally, FPGAs achieve this by the means of RTL/HDL, and
high efficiency) that will likely lead them to expand their presence
CGRAs could support domain-specific lightweight possibilities that
also in desktop and server systems, as the advent of Apple’s M1 may
allow temporal computations.
anticipate [301].
Memory sharing. Memory sharing describes whether a level of the
A surprising aspect that emerges from Table 6 is that the third
memory hierarchy is shared between the host and the accelerator. PIM
most abundant connection strategy is HMC, albeit considering the
accelerators often realize some form of memory sharing because their
omissions due to lack of information. It is used by PIM accelerators
that implement their logic in the logic die of an HMC stack and adopt logic is implemented in- or near- host memory that can also be used
the connections provided by the HMC technology, like DIMM slot and as classic memory in non-accelerated, CPU-driven tasks. Apart from
Intel QuickPath Interconnect [50]. After HMC comes DRAM slot, which Softbrain [10], that can be implemented to share LLC or DRAM with
is another solution used by PIM accelerators. In this case, these are the host, all the other PIM accelerators sharing a memory level share
designed as DRAM cards augmented with acceleration logic. DRAM. In particular, it happens for the many accelerators implemented
Bump-in-the-wire is a solution that works for accelerators perform- in the logic die of HMC stacks. This is not the only possible solution:
ing a fixed filtering task: ShiDianNao [273], Mixed-signal [238], and many PIM accelerators with in-SRAM logic, as Table 7 shows, could be
RobustInMem [268] are designed to be connected directly to a sensor implemented in the LLC or even into inner caches. Unfortunately, the
and filter data coming from it, while Microsoft Project Catapult [36] majority of papers describing them report no details on their possible
can be placed between the NIC and the top-of-rack switch. Network use as enhancements of existing caches.
interconnect is used by industrial accelerators designed to integrate eas- Another group of accelerators that share memory with the host
ily with a data-center infrastructure (i.e., Cerebras WSE [172], Google on some level are those integrated on-die. From a mere form factor
TPUs [197], and Graphcore IPU-M2000 [201]). They are shipped in perspective, these accelerators have access to a limited die area, do
PoDs that are connected to the network and controlled by hosts as not have room for memory apart from small local memories or caches,
network nodes. and thus need to be connected to the host memory. In some cases, it is
SW interface. SW interface describes the operations an application LLC (e.g., Apple Bionic Neural Engines [161,164], Huawei Kirins [221,
programmer must perform in order to interact with the accelerator. It 224]), in others it is DRAM (e.g., Google Pixel Visual Core [194],
is another category often not covered in academic proposals. Unlike Samsung Exynos [270,271]). We will examine in depth the implications
the connection strategy, it is possible to make assumptions based on of memory sharing on coherence and virtual memory in Section 5.
other characteristics of the accelerators, such as their programmability, OS coupling. Drivers are chosen in 41 out of 101 cases for OS
their reconfigurability, the nature of the accelerated workloads, the coupling – albeit 54 of these do not provide this detail. For many
architectural components, etc. Therefore, in this category, rather than accelerators, they are the de-facto standard way to interact with the
adding a no details column, we make assumptions based on these OS to perform various tasks, such as managing accelerators’ resources
aspects. via MMIO, orchestrating concurrent accesses, taking care of virtual-to-
Op execution is the most common host-side action to interact with physical memory mapping, and so on. In some cases, the accelerator
the accelerator. Its abundance depends on the fact that it is the only can work independently of the OS, as the few cases with none OS

20
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

coupling testify. Origami [247] achieves this by interposing a control- to provide the highest flexibility, as the integrated processors are able
ling FPGA between it and the host: an OS coupling between host and of performing any calculation on the data stored in each vault. Tesser-
FPGA is still necessary, but the accelerator can operate transparently. act [279] is a Graph processing accelerator designed according to the
The other three solutions, i.e., PROMISE [254], Softbrain [10], and same principle. The authors choose a generic “single-issue, in-order”
Spin-transfer [275], adopt ad-hoc ISA extensions that allow the CPU core similar to a Cortex-A5 as their computing engine. GraphH [203],
to communicate directly with the accelerator. On the other side of GraphP [50], and GraphQ [205] are Graph processing accelerators
the spectrum, the NEC SX-Aurora TSUBASA Vector Engine accelera- inspired by Tesseract. GraphH chooses to integrate an ARM Cortex-A5
tors [130] move many responsibilities into the OS, requiring even a with FPU, but without cache; while the other two are less specific à-la
custom Linux OS equipped with special services and modules. This is Tesseract. General-purpose scalar units are also present in another PIM
used to realize what the authors call the “reverse offloading” mecha- accelerator, UPMEM PIM [285,286] that integrates a scalar unit (DPU)
nism: when an executable is launched, the code is transferred to the into a standard DDR4 DIMM, and in the Manycore accelerator Baidu
accelerator, a process is allocated there and a corresponding pseudo- Kunlun K200, as part of their XPU-clusters [168], that feature an ALU
process is allocated on the host. This pseudo-process is responsible for: for basic instructions and an SFU for log, exp, sqrt, etc.
allocating virtual memory pages, asking a service to allocate physical Two accelerators only feature x86 processors, both Manycore pro-
pages on the accelerator, executing system calls/handling exceptions posals from Intel: Intel Xeon Phi [125,126], for Data-parallel work-
originated on the accelerator, and sending back the results [132]. loads, and Intel Nervana NNP-I [113], for ML inference. In both cases,
the presence of x86 cores is a plus from the programmability stand-
3.2.3. Architecture point, as familiar software stacks for desktop development can be
Accelerator descriptions, both in industrial datasheets and academic adapted to write accelerated code. Interestingly, both projects have
papers, usually reserve to the architecture presentation a prominent been abandoned. Xeon Phi presumably because of the competition with
role. Arguably, this aspect more than others highlights the peculiar- GPUs, as both families of accelerators target the same workloads, but
ity of an accelerator. As for our categorization, Table 7 shows the GPUs are generally cheaper and already widespread in desktop and
architectural details of the analyzed accelerators. We have been able data-center market segments. The promise of a common ISA between
to find relevant information for all the accelerators except Apple A13 data-parallel accelerator and general-purpose cores was not enough to
and A14 Bionic Neural Engines [161,164]. Table 7 does not list all steer the enormous momentum of the GPU ecosystem. Nervana was
their architectural components, but rather the main ones. We selected abandoned when Intel decided to acquire Habana Labs to switch to a
them so to avoid risking a dilution of the pertinent information and unified architecture for inference and training [302].
clutter the Table with hardware components too specific for single Special-purpose resources. The role cache memories play in
accelerators. general-purpose processors is usually played by local memories, that,
General-purpose resources. Caches are fundamental components conversely, are present in 86% of cases. Under the local memory name,
in general-purpose processors. They help to increase data bandwidth we list on-chip SRAM memories implicitly managed by the execution
and reduce latency dramatically, avoiding costly accesses to the main flow (e.g., to store intermediate results) or explicitly managed by
memory. However, they are not as common in accelerators: about 23% the programmer. Unfortunately, the lack of details concerning the
of the listed ones have an L1 cache, 15% an L2 cache, and only 4% an software aspects for many accelerators does not allow us to make a
L3 cache. Most of the time, they are replaced by local memories. clear distinction for them. Local memories are a simple but effective
Accelerators that feature a DRAM technology also found in general- solution to tackle wire-delay issues and limit transfers from the main
purpose systems (DDR3, DDR4, LPDDR4, LPDDR4X) are even fewer memory through data reuse, improving efficiency, latency, and also
than those with cache memories: only about 9%. This is due to many performance [12,303]. These memories are usually small, but this is not
factors: integrated accelerators do not need their DRAM, as they usually a limiting factor, as load and store cycles can be effectively integrated
connect to a shared LLC or directly to the system DRAM; the variety of in streaming and pipelining mechanisms with no stalls, overlapping
existing memory technologies allow designers to select solutions that computation and data transfers.
are a better fit for the accelerator requirements (e.g., GDDRs have With respect to cache memories, local memories allow for a higher
higher bandwidth and larger buses optimized for graphics workloads, flexibility, since they can be managed according to the policies that
HBMs have higher bandwidth and efficiency); some PIM accelera- most apply to the accelerated task, in both execution-driven and code-
tors are designed around a different memory technology and cannot driven cases. In particular, the first kind is the most appropriate choice
adapt to DRAM (e.g., ReRAM, HMC). However, there are also PIM for accelerated tasks with a regular memory access and reuse pattern,1
accelerators that, instead, augment the standard DRAM with near-data while the second kind can be employed even in irregular cases. We
processing: Micron Automata Processor [231] and NDA [241] augment can see in Table 7 that only 20% of accelerators that feature a cache
DDR3, NEST [242] and UPMEM PIM [285] augment DDR4. memory do not rely also on a local memory: GraphP [50], Intel Xeon
Table 7 lists general-purpose computing engines directly involved Phis [125,126], NEC SX-Aurora TSUBASAs [130], Qualcomm Hexagon
in the accelerated computations as the main work-force. We exclude DSPs [262], and Tesseract [279].
those involved in orchestration/control of other resources. We make The table also highlights that 3D-stacked memories are a mature
this distinction because almost every accelerator uses one or more technology that can be employed in various cases — about one fifth
controllers to manage its resources (e.g., memory units, systolic arrays, of the total. JEDEC’s HBM [53], Micron’s HMC [48,49], and Intel’s
etc.), but only 10% of them employ general-purpose cores as their MCDRAM [52] (which is based on HMC), are all high-bandwidth mem-
principal computing engines. Therefore, we concentrate our analysis on ory technologies designed as a “memory cube”, i.e., a stack of memory
the last case. dies, with one logic die on the bottom, with through-silicon vias and
Xilinx Versal ACAP [152] accelerators have two ARM dual-cores micro-bumps. These technologies are employed in PIM accelerators,
on-board: Cortex-A72 and Cortex-R5F. They may have a role in acceler- with the computing logic implemented in the logic die, and to serve
ating diverse tasks, but can also run an O.S. to make the Versal ACAP an as high-performance on-board memories for PCI-e based accelerators.
autonomous board, rather than an accelerator. Two ML-oriented (both About half of the accelerators rely on special-purpose compute
inference and training) PIM accelerators use ARM cores as comput- resources. These are roughly equally represented across accelerators,
ing engines: Heterogeneous-PIM [212] and PracticalNDP [252]. They
integrate a simple general-purpose in-order ARM core (Cortex-A7 in
PracticalNDP, Cortex-A9 in Heterogeneous-PIM) into the logic die of an 1
Nowatzki et al. claim that most “acceleratable algorithms” share this
HMC stack to realize near-data processing. This solution allows them characteristic [10].

21
Table 7

B. Peccerillo et al.
Architectural aspects.
General-purpose resources Special-purpose resources
Memory Comp. engine Memory Comp. engine Fixed-function PL Mem+comp.

ML auxiliary-functions engine
Rendering pipeline engine

Compr./Decompr. engine

DB Operations engine
Cryptography engine

Near-memory logic
AD/DA Converter

Alignment engine
Vector processor

In-DRAM logic
Physics engine

In-SRAM logic
Local memory

Systolic array
Spatial array

Coarse-grain
Tensor core

STT-MRAM
Scalar unit

HMC-RAM
ARM CPU

Fine-grain
MCDRAM
LPDDR4X

x86 CPU

GDDR6X
LPDDR4

eDRAM

ReRAM
GDDR6

HBM2
DDR3

DDR4

HBM
L1

L2

Accelerator L3
AHA371,

AHA372
AHA374,

AHA378
AHA604,
✕ ✕
AHA605
AMD Radeon
✕ ✕ ✕ ✕ ✕ ✕ ✕
RX 5000 Series
AMD Radeon
✕ ✕ ✕ ✕ ✕ ✕ ✕
RX 6000 Series
ARM Mali
✕ ✕ ✕ ✕ ✕
Bifrost
ARM Mali
✕ ✕ ✕ ✕ ✕
Valhall
Apple A13/A14
22

Bionic Neural – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –
Enginea
BioSEAL ✕ ✕
Baidu Kunlun b
✕ ✕ ✕ ✕
K200
Cambricon-X ✕ ✕ ✕
CASCADE ✕ ✕c ✕ ✕ ✕
Cerebras WSE ✕ ✕ ✕
Coral USB
✕ ✕ ✕
Accelerator
CSRAM ✕ ✕ ✕
DaDianNao
✕ ✕ ✕ ✕ ✕
chip
Darwin ✕ ✕ ✕

Journal of Systems Architecture 129 (2022) 102561


Darwin-WGA ✕ ✕
DianNao ✕ ✕ ✕
DIMA Inference
✕ ✕ ✕
Processor
DIMA-CNN ✕ ✕ ✕ ✕
DRISA ✕ ✕ ✕
DUAL ✕ ✕ ✕ ✕
Eyeriss ✕ ✕ ✕
Eyeriss v2 ✕ ✕ ✕
FlexFlow ✕ ✕ ✕
FloatPIM ✕ ✕ ✕ ✕
FPSA ✕ ✕ ✕ ✕ ✕
GenCache ✕ ✕ ✕ ✕
Google Pixel
✕ ✕
Visual Core
(continued on next page)
Table 7 (continued).

B. Peccerillo et al.
General-purpose resources Special-purpose resources
Memory Comp. engine Memory Comp. engine Fixed-function PL Mem+comp.

ML auxiliary-functions engine
Rendering pipeline engine

Compr./Decompr. engine

DB Operations engine
Cryptography engine

Near-memory logic
AD/DA Converter

Alignment engine
Vector processor

In-DRAM logic
Physics engine

In-SRAM logic
Local memory

Systolic array
Spatial array

Coarse-grain
Tensor core

STT-MRAM
Scalar unit

HMC-RAM
ARM CPU

Fine-grain
MCDRAM
LPDDR4X

x86 CPU

GDDR6X
LPDDR4

eDRAM

ReRAM
GDDR6

HBM2
DDR3

DDR4

HBM
L1

L2

L3
Accelerator
Google TPU
✕ ✕ ✕ ✕
v2, v3
Graphcore
✕ ✕ ✕
IPU-M2000
GraphH ✕ ✕ ✕ ✕
Graphicionado ✕ ✕
GraphP ✕ ✕ ✕ ✕
GraphQ ✕ ✕ ✕ ✕ ✕
GraphR ✕ ✕ ✕
GRIM-Filter ✕ ✕ ✕ ✕
GroqCard ✕ ✕ ✕ ✕
Hailo-8 ✕ ✕
Heterogeneous-
✕ ✕ ✕ ✕ ✕ ✕
PIM
HReA ✕ ✕ ✕ ✕ ✕
HRL ✕ ✕ ✕ ✕ ✕
23

Huawei Atlas 200


✕ ✕ ✕ ✕ ✕d
AI, 300I, 300I Pro
Huawei Atlas
✕ ✕ ✕ ✕ ✕ ✕d
300T
Huawei Kirin
9000x, 990x ✕ ✕ ✕ ✕d
NPU
IMP ✕ ✕ ✕
Intel FPGA 10
✕ ✕ ✕
Series
Intel FPGA
✕ ✕ ✕
F-Series
Intel FPGA V
✕ ✕
Series

Journal of Systems Architecture 129 (2022) 102561


Intel Graph. Tech.
✕ ✕ ✕ ✕ ✕ ✕
Gen11, Xe-LP
Intel Habana
✕ ✕ ✕
Labs HL-100
Intel Habana
✕ ✕ ✕ ✕
Labs HL-20x
Intel Nervana
NNP-I 1100, ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
1300
Intel Nervana NNP-T
✕ ✕ ✕ ✕
1300, 1400
Intel Xeon Phi
Knights Landing, ✕ ✕ ✕ ✕ ✕ ✕
Mill
(continued on next page)
B. Peccerillo et al.
Table 7 (continued).
General-purpose resources Special-purpose resources
Memory Comp. engine Memory Comp. engine Fixed-function PL Mem+comp.

ML auxiliary-functions engine
Rendering pipeline engine

Compr./Decompr. engine

DB Operations engine
Cryptography engine

Near-memory logic
AD/DA Converter

Alignment engine
Vector processor

In-DRAM logic
Physics engine

In-SRAM logic
Local memory

Systolic array
Spatial array

Coarse-grain
Tensor core

STT-MRAM
Scalar unit

HMC-RAM
ARM CPU

Fine-grain
MCDRAM
LPDDR4X

x86 CPU

GDDR6X
LPDDR4

eDRAM

ReRAM
GDDR6

HBM2
DDR3

DDR4

HBM
L1

L2

L3
Accelerator
ISAAC ✕ ✕ ✕ ✕ ✕
LerGAN ✕ ✕ ✕ ✕
Micron
✕ ✕ ✕
Automata Proc.
Microsoft
Project ✕ ✕
Catapult
Mixed-signal ✕ ✕ ✕
Multibit ✕ ✕
NAND-Net ✕ ✕ ✕ ✕
NDA ✕ ✕ ✕ ✕
NEC SX-Aurora
TSUBASA V.E. ✕ ✕ ✕ ✕ ✕
Type10, Type20
NEST ✕ ✕ ✕ ✕
24

Neural Cache ✕ ✕ ✕ ✕
Neurocube ✕ ✕ ✕ ✕ ✕ ✕
NonVolCIM ✕ ✕ ✕
NP-CGRA ✕ ✕ ✕
NVIDIA
GeForce ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
RTX 20xx
NVIDIA
GeForce ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
RTX 30xx
Origami ✕ ✕
PipeLayer ✕ ✕ ✕ ✕
Plasticine ✕ ✕ ✕

Journal of Systems Architecture 129 (2022) 102561


PLiM ✕ ✕
PracticalNDP ✕ ✕ ✕ ✕ ✕ ✕
PRIME ✕ ✕ ✕ ✕
PROMISE ✕ ✕ ✕
PuDianNao ✕ ✕ ✕
PUMA ✕ ✕ ✕ ✕ ✕
PX-CGRA ✕ ✕ ✕
Q100 ✕
Qualcomm Hex.
✕ ✕ ✕
698, 780 DSP
RADAR ✕ ✕ ✕ ✕
RAPID ✕ ✕ ✕ ✕
REMUS ✕ ✕ ✕ ✕
RobustInMem ✕ ✕ ✕ ✕
(continued on next page)
B. Peccerillo et al.
Table 7 (continued).
General-purpose resources Special-purpose resources
Memory Comp. engine Memory Comp. engine Fixed-function PL Mem+comp.

ML auxiliary-functions engine
Rendering pipeline engine

Compr./Decompr. engine

DB Operations engine
Cryptography engine

Near-memory logic
AD/DA Converter

Alignment engine
Vector processor

In-DRAM logic
Physics engine

In-SRAM logic
Local memory

Systolic array
Spatial array

Coarse-grain
Tensor core

STT-MRAM
Scalar unit

HMC-RAM
ARM CPU

Fine-grain
MCDRAM
LPDDR4X

x86 CPU

GDDR6X
LPDDR4

eDRAM

ReRAM
GDDR6

HBM2
DDR3

DDR4

HBM
L1

L2

L3

Accelerator
Samsung
Exynos 9825, ✕ ✕ ✕
990 NPU
Sandwich-RAM ✕ ✕ ✕
ShiDianNao ✕ ✕ ✕
Softbrain ✕ ✕ ✕
SparseReRAM ✕ ✕ ✕ ✕
Spin-transfer ✕ ✕
TensorNode ✕ ✕ ✕
Tesla Full Self
25

Driving ✕ ✕ ✕
Computer NPU
Tesseract ✕ ✕ ✕ ✕ ✕ ✕
TETRIS ✕ ✕ ✕ ✕ ✕ ✕
Time ✕ ✕ ✕ ✕
Untether
✕ ✕
TsunAImi
UPMEM PIM ✕ ✕ ✕
X-CGRA ✕ ✕ ✕
Xilinx FPGA 7
✕ ✕ ✕
Series
Xilinx FPGA
Ultrascale+ ✕ ✕ ✕

Journal of Systems Architecture 129 (2022) 102561


Series
Xilinx Versal
✕ ✕ ✕ ✕ ✕ ✕
ACAP
YodaNN ✕ ✕
a No architectural details available.
b Cores with tensor and vector units.

c
Systolic array of ReRAM arrays.
d
Cores with tensor, vector, and scalar units.
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

but systolic arrays and vector processors are the most abundant. We The second most prominent fixed-function blocks are ADC/DAC
already discussed the role of systolic arrays in ML accelerators. As converters. CSRAM [179], an in-SRAM accelerator, and ReRAM-based
for vector processors, they can be found in three variations: classic ones are equipped with them because they perform multiplications
vector processors with deep pipelines and variable number of lanes in the analog domain. Some ML-oriented in-SRAM accelerators per-
(e.g., NEC SX-Aurora TSUBASA [130]), SIMD-like vector processors form mixed-signal multiplications between weights and data (DIMA-
with a fixed number of lanes (e.g., Baidu Kunlun K200 [168], Qual- CNN [185], Mixed-signal [238], PROMISE [254], and
comm Hexagon DSPs [262], Intel Habana Labs [227,228]), massively RobustInMem [268]). Finally, Xilinx FPGAs [142,147,152] feature
multi-threaded vector cores (e.g., AMD [74], NVIDIA [135,138], In- them to convert output signals into analog ones or input signals into
tel [105,109] GPUs). Vector processors provide a level of parallelism digital ones.
in the order of tens to hundreds of primitive elements. Some accel-
erator designs like Manycore and GPU replicate these vector cores to 3.2.4. Software aspects
achieve an even higher level of parallelism. They are also employed in Table 8 summarizes the software aspects of the identified accelera-
PUMA [256], a PIM accelerator that integrates a ReRAM crossbar and tors. As for Host coupling, there are various accelerators from Academia
a SIMD unit to support various ML workloads. that do not disclose this information. We make assumptions based on
According to our classification, spatial arrays can be found in Spa- other characteristics of the accelerators.
tial accelerators (programmable PEs) and CGRAs (reconfigurable PEs). Programming Layer. In this work, we confirm that “libraries are
However, there are three notable examples: TETRIS [280], a PIM a universal ‘programming model’ for all kinds of accelerators” [304].
accelerator which integrates an NN engine organized as a spatial array Independently of the particular approach chosen by manufacturers, it is
of PEs in each vault of an HMC stack, Xilinx Versal ACAP [152], an always possible to wrap the CPU-accelerator communication logic into
FPGA augmented with two general-purpose cores and the so-called library calls for the most common programming languages. Moreover,
Intelligent Engines, arrays of interconnected VLIW and SIMD engines, there can be various layers of libraries that offer an increasing level of
and GroqCard [208]. The last has an original architecture that the abstraction, with wrappers of driver functions or even inline assembly
authors call Tensor Streaming Processor (TSP): the accelerator’s core is with ad-hoc instructions [10,254,275] at the bottom.
a heterogeneous spatial array, with its elements tiled “per role”. A tile Around 41% of the accelerators in the table can be programmed
is dedicated to instruction decode and dispatch, and flow Northward with an ad-hoc high-level language. For accelerators with a high flex-
towards the functional tiles [210]. ibility, like those that operate in Data-parallel, Dataflow, or General-
The last special resource, tensor cores, are employed in various ac-
purpose domains, this is the design choice that most of all gives
celerators as basic building blocks in more complex structures, as well
programmer’s productivity. However, it is also the most complex, as it
as ALUs, simple PEs, scalar, and vector units. They are present in Many-
needs at least a compiler/interpreter and a debugger. For this reason,
core accelerators (e.g., Baidu Kunlun K200 [168], Intel Nervana [113,
many designers prefer simpler solutions such as library wrappers, that
118], DaVinci-powered Huawei accelerators [217,218,224]), GPUs
generally give an adequate level of abstraction for accelerators with
(Turing and Ampere architectures from NVIDIA [135,138]), Spatial ac-
limited flexibility.
celerators (Cerebras WSE [172]), and PIM accelerators
On the opposite side of the spectrum, there are HDL and assembly.
(Neurocube [244], TensorNode [276]).
The former is used as a tool to configure the programmable logic of
In today’s accelerators, Programmable Logic is not limited any
reconfigurable accelerators, mostly FPGAs’ fine-grain logic. Apart from
longer to FPGAs. Depending on the case, it may play a primary or an
HRL, which adopts a design flow similar to FPGAs [214], most CGRAs
ancillary role. It plays a primary role in FPGAs (fine-grain PL) and in
choose lightweight procedures to reconfigure their programmable logic.
CGRAs (coarse-grain PL), where it is responsible for the computing ele-
Thanks to this and their shorter reconfiguration time, the reconfigura-
ments and the interconnection between them. It plays an ancillary role
tion process can be performed at runtime and included in a workflow
in: DRISA [186], a PIM accelerator with reconfigurable Boolean logic
that makes CGRAs eligible for temporal computations.
operations performed in the memory cells; and the aforementioned
FPSA [192] and Micron Automata Processor [231]. Assembly is supported in a few cases (about 9%). Most of them
The abundance of PIM accelerators in Table 5 implies an abundance are Data-parallel accelerators: AMD and NVIDIA GPUs [74,135,138]
of “Memory and Computing” components in Table 7 that detail the way (albeit as virtual ISA that is later translated to the target hardware ISA,
processing-in-memory is achieved. A surprising datum that emerges is generally not disclosed by vendors), NEC SX-Aurora TSUBASA [130],
that the most adopted solution to PIM is ReRAM, that is present on Baidu Kunlun K200 [168], Intel Xeon Phis [125,126], and PLiM [251].
about 39% of PIM accelerators. Although it is an emerging technology, There are also two accelerators for ML inference (i.e., PROMISE [254]
there are many researchers proposing its use in accelerators. All of them and Spin-transfer [275]) and one for Dataflow (i.e., Softbrain [10]).
are academic proposals, as there are still challenges related to ReRAM Ecosystem. The Ecosystem category best highlights the acceler-
employment, as we will analyze in Section 5. After ReRAM, the most ators’ heterogeneity: a plethora of frameworks, compilers, languages,
common approach is the near-memory one, as achieved by about 31% and libraries populate the table. This variety involves mainly industrial
of PIM accelerators. 60% of these rely on HMC and implement their accelerators, as for half of the total – all from Academia – we do
acceleration logic in the logic die. About one fifth of PIM accelerators not have any information. Overall, TensorFlow [56], a ML-oriented
propose in-SRAM logic, that could be used to augment local memories library in Python, is the most represented element, as it is supported
and caches with computation capabilities. Cache augmentation, in par- by 14 accelerators. It is the de-facto standard for ML, albeit other
ticular, may even invest general-purpose processors, leading to a major solutions like PyTorch [305], ONNX [306], PaddlePaddle [307], and
leap away from von Neumann. Finally, there are only four accelerators TensorFlow Lite [308] are also common. It is not even limited to
that propose the use of in-DRAM logic and one that uses STT-MRAM. ML workloads, as the authors of IMP [226] chose it as the program-
Finally, fixed-function components are employed in more than 60% ming framework for their Data-parallel PIM accelerator, observing
of accelerators, generally to aid programmable and reconfigurable that “TensorFlow’s programming semantics is a perfect marriage of
components in task acceleration. Since the majority of accelerators are data-flow and vector-processing that can be applied to more general
dedicated to ML workloads, the most common fixed-function compo- applications” [226].
nents are dedicated to auxiliary ML-related functions, such as activation The authors of Heterogeneous-PIM [212] do a complementary
functions (e.g., sigmoid, tanh, ReLU). Since these functions are invoked choice: they extend the OpenCL [57,58,309] programming model,
many times in ML workloads, it is reasonable to provide them as hard- which is intended for data-parallel tasks, to express NN training op-
ware blocks even in programmable accelerators that could calculate erations for their accelerator. Table 8 shows that OpenCL is supported
them as sequences of arithmetic instructions. across very different accelerators, including GPUs, Manycores, the

26
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Table 8
Accelerators’ software aspects [310–354].
Programming layer Granularity

high-level languages

application-level
function-level
wired task
assembly

libraries
HDL
Accelerator Ecosystem
AHA371, AHA372, AHA374, AHA378 ✕ gzip [310], ZLIB [311], Java API, Hadoop [312] ✕
AHA604, AHA605 ✕ gzip [310], ZLIB [311], OpenSSL [313] ✕
OpenCL [57,58,309], ROCm [314], Sycl [59], Kokkos [355],
OpenMP4+ [60], OpenACC [315], SkelCL [356], StarPU [316],
AMD Radeon RX 5000 Series, 6000 Series ✕ ✕ ✕ ✕
Halide [317], OmpSs [318], OpenGL [357], Vulkan [63],
DirectX [358], Metal [319]
ARM Mali Bifrost, Valhall ✕ ✕ OpenCL [57,58,309], Sycl [59], OpenGL [357], Vulkan [63] ✕
Apple A13 Bionic Neural Engine,
✕ TensorFlow Lite [308], PyTorch [305], Core ML [320] ✕
A14 Bionic Neural Engine
XTCL, XTDK [167], TensorFlow [56], PaddlePaddle [307],
Baidu Kunlun K200 ✕ ✕ ✕ ✕
PyTorch [305]
BioSEAL ✕ ✕ ✕
Cambricon-X ✕ Caffe [64] ✕
CASCADE ✕ ✕
Cerebras WSE ✕ ✕ Cerebras SDK [321], TensorFlow [56], PyTorch [305] ✕
Coral USB Accelerator ✕ TensorFlow Lite [308], PyCoral [73] ✕
CSRAM ✕ ✕
DaDianNao chip ✕ ✕
Darwin ✕ ✕
Darwin-WGA ✕ ✕
DianNao ✕ ✕
DIMA Inference Processor ✕ ✕
DIMA-CNN ✕ ✕
DRISA ✕ ✕ ✕
DUAL ✕ ✕
Eyeriss ✕ Caffe [64] ✕
Eyeriss v2 ✕ ✕
FlexFlow ✕ ✕
FloatPIM ✕ ✕
FPSA ✕ ✕ NN compiler [322] ✕
GenCache ✕ ✕
Google Pixel Visual Core ✕ ✕ Halide [317], TensorFlow [56], Android Camera API [323] ✕
Google TPU v2, v3 ✕ TensorFlow [56], scikit-learn [324], XGBoost [325], Keras [359] ✕
Poplar SDK [326], TensorFlow [56], ONNX [306],
Graphcore IPU-M2000 ✕ ✕
PaddlePaddle [307], PyTorch [305], Keras [359]
GraphH ✕ ✕ ✕
Graphicionado ✕ ✕ GraphMath [327] ✕
GraphP ✕ ✕ ✕
GraphQ ✕ ✕ ✕
GraphR ✕ ✕
GRIM-Filter ✕ ✕
GroqCard ✕ TensorFlow [56] ✕
Hailo-8 ✕ TensorFlow [56], ONNX [306], AI SDK [328] ✕
Heterogeneous-PIM ✕ ✕ OpenCL [57,58,309] ✕
HReA ✕ ✕
HRL ✕ ✕ Verilog [61] ✕
CANN [329], MindSpore [330], TensorFlow [56],
Huawei Atlas 200 AI, 300I, 300I Pro, 300T ✕ ✕ ✕
PyTorch [305], PaddlePaddle [307], MindX SDK [331]
HiAI DDK [72], CANN [329], TensorFlow Lite [308], Android
Huawei Kirin 990x NPU, 9000x NPU ✕ ✕ ✕
NNAPI [332], MindSpore [330], PaddlePaddle [307]
IMP ✕ TensorFlow [56] ✕
Intel Quartus Prime [71], DSP Builder [333], Intel HLS [334],
Intel FPGA V Series, 10 Series, F-Series ✕ ✕ ✕ ✕
Vivado Design Suite [335], OpenCL [57,58,309]
OpenCL [57,58,309], Sycl [59], OpenMP4+[60],
Intel Graphics Technology Gen11, Xe-LP ✕ ✕ OpenACC [315], SkelCL [356], StarPU [316], Halide [317], ✕
OmpSs [318], OpenGL [357], Vulkan [63], DirectX [336]
TensorFlow [56], PyTorch [305], ONNX [306], MXNet [337],
Intel Habana Labs HL-100 ✕ ✕ ✕
Glow ML Compiler [338], Synapse AI Suite [227]
Intel Habana Labs HL-20x ✕ ✕ TensorFlow [56], SynapseAI Suite [228] ✕
TensorFlow [56], OpenVINO [339], PyTorch [305],
Intel Nervana NNP-I 1100, 1300 ✕ ✕ ✕
nGraph [340,341], ONNX [306]
TensorFlow [56], PaddlePaddle [307], PyTorch [305],
Intel Nervana NNP-T 1300, 1400 ✕ ✕ ✕
nGraph [340,341]
(continued on next page)

27
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Table 8 (continued).
Programming layer Granularity

high-level languages

application-level
function-level
wired task
assembly

libraries
HDL
Accelerator Ecosystem
Standard x86 tools, OpenCL [57,58,309], Sycl [59],
Intel Xeon Phi Knights Landing,
✕ ✕ ✕ OpenMP [60], OpenACC [315], SkelCL [356], StarPU [316], ✕ ✕
Knights Mill
Halide [317], OmpSs [318]
ISAAC ✕ ✕
LerGAN ✕ ✕ ✕
Micron Automata Processor ✕ ✕ ANML [234], AP SDK, RAPID [342] ✕
Microsoft Project Catapult ✕ ✕ Verilog [61], VHDL [62,343] ✕
Mixed-signal ✕ ✕
Multibit ✕ ✕
NAND-Net ✕ ✕
NDA ✕ ✕
NEC SX-Aurora TSUBASA Vector Engine
✕ ✕ ✕ VE offloading C-API, NEC SDK [344] ✕
Type10, Type20
NEST ✕ ✕
Neural Cache ✕ ✕
Neurocube ✕ ✕
NonVolCIM ✕ ✕
NP-CGRA ✕ ✕
CUDA [65,345], OpenCL [57,58,309], Sycl [59],
PHAST [360], Kokkos [355], OpenMP4+[60], OpenACC [315],
NVIDIA GeForce RTX 20xx, 30xx ✕ ✕ ✕ SkelCL [356], StarPU [316], Halide [317], OmpSs [318], ✕
Matlab [346], OpenGL [357], Vulkan [63], DirectX [347],
Metal [319]
Origami –a –a –a –a ✕
PipeLayer ✕ ✕
Plasticine ✕ ✕ Delite Hardware Definition Language [348] ✕
PLiM ✕ ✕ ✕
PracticalNDP ✕ ✕ Phoenix++[349] ✕
PRIME ✕ ✕ ✕
PROMISE ✕ ✕ ✕ MXNet [337], Flux [350,351] ✕
PuDianNao ✕ ✕
PUMA ✕ Caffe [64], PyTorch [305], MXNet [337], ONNX [306] ✕
PX-CGRA ✕ ✕ Matlab [287] ✕
Q100 ✕ ✕ SQL ✕
Hexagon DSP SDK [66], Android NN API [332], TensorFlow
Qualcomm Hexagon 698 DSP, 780 DSP ✕ ✕ Lite [308], Qualcomm Neural Processing SDK [67], ✕
ONNX [306], PyTorch [305]
RADAR ✕ ✕
RAPID ✕ ✕
REMUS ✕ ✕ ✕
RobustInMem ✕ ✕
Android NN API [332], TensorFlow Lite [308],
Samsung Exynos 9825 NPU, 990 NPU ✕ ✕
Samsung Neural SDK [68]
Sandwich-RAM ✕ ✕
ShiDianNao ✕ ✕
Softbrain ✕ ✕ ✕ ✕
SparseReRAM ✕ ✕
Spin-transfer ✕ ✕ ✕
TensorNode ✕ ✕
Tesla Full Self Driving Computer NPU ✕ NN compiler ✕
TETRIS ✕ ✕
Tesseract ✕ ✕ ✕
Time ✕ ✕
Untether TsunAImi ✕ TensorFlow [56], PyTorch [305], imAIgine SDK [282] ✕
UPMEM PIM ✕ ✕ UPMEM SDK [69] ✕
X-CGRA ✕ ✕ Matlab [287] ✕
Vitis Unified Software Platform [70], Vivado Design
Xilinx FPGA 7 Series, Ultrascale+ Series ✕ ✕ ✕ Suite [352], OpenCL [57,58,309], Verilog [61,353], ✕
VHDL [62,343], Matlab [354]
Vitis Unified Software Platform [70], Vivado Design
Xilinx Versal ACAP ✕ ✕ ✕ Suite [352], OpenCL [57,58,309], Verilog [61,353], ✕ ✕
VHDL [62,343]
YodaNN ✕ ✕
a
It depends on the attached FPGA.

28
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Fig. 8. Power consumption (Watt) vs aggregated throughput (TOPS) for data-parallel accelerators. (For interpretation of the references to color in this figure legend, the reader
is referred to the web version of this article.)

aforementioned PIM accelerator, and even FPGAs. In the last case, Finally, three accelerators support application-level granularity: Xil-
it is used as a high-level synthesis tool, as the kernels expressed in this inx Versal ACAP [152], accelerators in the Xeon Phi family [125,126],
language are compiled into a binary that can be used to reconfigure the and NEC SX-Aurora TSUBASA Vector Engines [130]. The first two can
programmable logic of these accelerators. be used in a mode that Intel calls native mode (opposed to offload mode).
Other more traditional options are also present in the table, like In this mode, they can operate as stand-alone nodes and take care
Verilog [61] and VHDL [62]. Finally, the type of accelerators with also of the prerogatives commonly reserved to the host. The last one
the richest ecosystem is by far GPUs. Graphics-oriented APIs like achieves application-level granularity thanks to the “reverse offloading”
OpenGL [357], Vulkan [63], and DirectX [358] are flanked by purely mechanism described before.
data-parallel solutions like the aforementioned OpenCL, but also
Sycl [59] and OpenMP [60]. 4. Performance study
Granularity. The granularity category shows a strong correla-
tion with the programming layer. Accelerators that support high-level
Due to the variety and heterogeneity of the presented accelerators,
programming languages have at least a function-level granularity.
we cannot define a performance metric or a set of metrics that are
However, it is not true the opposite, as this can be achieved by the
descriptive of all of them. Performance can refer to throughput, latency,
means of reconfigurable logic, but also with libraries only, as five
memory bandwidth, efficiency, etc. Some of them are not available for
non-reconfigurable accelerators testify. In these cases, function-level
every accelerator, whether because they do not apply or because they
granularity is achieved by specifying a callback that is compiled JIT
are not disclosed by vendors. However, a common metric that can be
and executed on the accelerator, e.g., TensorFlow’s custom activation
defined for basically every accelerator is throughput – but also in this
functions. Conversely, reconfigurable accelerators can employ HDL and
libraries, the latter mainly to write configuration data in CGRAs at case, there is no homogeneity in what it refers to. First, throughput can
runtime. refer to the number of operations performed by all the units in the
Wired task granularity is the most common, with around 52% of accelerator (ALUs, FPUs, PEs, etc.) per second or the number of “ar-
accelerators supporting it. It can refer to simple operations, equivalent tifacts” produced in output per second (images per second, compressed
to common instructions such as LOAD or ADD, but also to more com- bytes per second, etc.). We define the former aggregated throughput, and
plex ones that trigger a non-trivial computation on the accelerator. So, the latter output throughput. Second, even interpreting throughput the
the remaining 48% of accelerators support at least function-level gran- same way, it can be expressed with different units of measure: it can
ularity. This datum highlights that accelerators, despite their domain- be given in FP32 FLOPS, FP64 FLOPS, Gbps, images per second, and
specific nature, are generally flexible rather than inflexible devices. This so on, depending on the domain in which the accelerator operates, the
may sound surprising: accelerators, after all, pursue high throughput data types the accelerator supports, and the taste of the vendor in terms
and efficiency, that find their maximum realization in ASICs. However, of what characteristic they prefer to highlight.
being inflexible may limit too much, in the majority of cases, the set of Aggregated throughput metrics are generally available for accelera-
acceleratable workloads — which is hardly convenient from the users’ tors in the ML inference, ML training, and Data-parallel domains. These
perspectives outside of very selected and critical contexts. values are expressed in FLOPS or OPS, but can be easily normalized

29
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Fig. 9. Maximum performance achieved until each year for each type of data-parallel accelerator.

and compared each other by multiplying the metric by the number of throughput and no native support for FP16. In the following generation,
bytes of the processed data type, obtaining a Bps (Bytes per second) which debuted in 2017, Knights Mill accelerators [127–129] had the
value. Unfortunately, output throughput metrics, where available, are same FP16 and FP32 throughput, and 1/4 FP64 throughput. At the time
highly heterogeneous and cannot be normalized, even for accelerators of writing, it is rumored that upcoming Intel Alchemist GPUs will skip
in the same domain. Therefore, in this paper, we focus our analysis FP64 support altogether [361].
on aggregated throughput expressed in Bps for accelerators in the NEC SX-Aurora TSUBASA models [130], that do not target explicitly
ML inference, ML training, and Data-parallel domains. We relate this ML models, still in 2019 have no native support for FP16 and have the
performance to the power dissipated at full utilization, for which we same FP32 and FP64 throughput. In this case, 64-bit entries from vector
take the Thermal Design Power (TDP) as an estimation. This way, registers can be treated as two packed 32-bit floats, which are processed
we frame accelerators from an efficiency standpoint, intended as the together by double precision units.
achievable throughput per power unit or, equivalently, number of bytes Baidu Kunlun K200 [168] has Data-parallel capabilities, but it is
processed per energy unit. advertised mostly as an ML-oriented accelerator. It follows the general
Some accelerators categorized so far are presented as families, tendency towards precision: maximum throughput is achieved with
rather than single models. Even in the same family, different models INT8 operations (256 TBps), followed by FP16 (128 TBps), and FP32
can perform very differently. Therefore, in the figures in this section we (64 TBps).
show the achieved performance of accelerator models. The association In Fig. 8, there are only two academic proposals: IMP [226], a PIM
between families and models is displayed in Table 2. accelerator, and Plasticine [250], a CGRA. Considering that academic
proposals are generally prototypes, it is even more remarkable that
4.1. Data-parallel Plasticine achieves about the same FP32 throughput as an RTX3060
GPU with about 25% its power consumption – i.e., it is 4× as efficient.
Fig. 8 shows the relation between power consumption and nor- Since it is the only CGRA in the figure, we cannot generalize and
malized aggregated throughput (in the following: throughput) in data- affirm that this is due to the high efficiency of this type of accelerators
parallel accelerators. These include GPUs, Manycores, Vectors, one compared to GPUs. However, such a claim would be in line with other
PIM, and one CGRA. We can identify two main groups of accelera- studies that highlight how CGRAs generally achieve better efficiency
tors: laptop and desktop/server accelerators, with power in the range than other architectures (except ASICs) [10,18,214,241].
10–50 W and 50–500 W, respectively. All accelerators can be easily Overall, Baidu Kunlun K200 [168] achieves the maximum INT8
assigned to one group or the other, with the only exception of Plas- throughput (256.00 TBps) and FP16 throughput (128.00 TBps), and
ticine [250], at the boundary with 43 W. The figure clearly displays NVIDIA RTX3090 [138] the maximum FP32 throughput (142.40 TBps).
the strong bond that exists between power consumption and achiev- As for efficiency, Baidu Kunlun K200 [168] achieves the maximum
able throughput, with accelerators in the lower (higher) power group INT8 efficiency (1.707 TBps/W) and FP16 efficiency (0.852 TBps/W),
generally achieving lower (higher) throughput. and Plasticine [250] the maximum FP32 efficiency (1.144 TBps/W).
All the GPUs except Intel’s UHD730 [110] and UHD770 [111] have
Fig. 9 shows the throughput and efficiency trends in data-parallel
higher FP16 and FP32 throughput than FP64. Generally, FP64 units
accelerators, grouped by type. For each year, it shows the maximum
are very few: for instance, on NVIDIA Ampere GPUs, each Streaming
performance achieved until that year. In formulas, the value 𝑉 assumed
Multiprocessor features 128 FP32 cores and only 2 FP64 cores [138].
by the curve associated to data-parallel accelerators of type 𝑇 the year
There are two complementary reasons behind this design choice. On
𝑦0 is calculated as:
the one hand, for most workloads commonly executed on GPUs, a 64-
bit precision is not needed (e.g., graphics). On the other, the need for 𝑉𝑇 (𝑦0 ) = max 𝑃 𝑒𝑟𝑓𝐴
𝐴∈𝐷𝑃
FP16 support is increasing: according to a significant body of work, 𝑇 ,𝑦
𝑦≤𝑦0
in ML workloads, precision can be safely reduced with small accuracy
losses and high gains in performance, memory bandwidth, and storage where 𝑃 𝑒𝑟𝑓𝐴 is the performance achieved by accelerator 𝐴, and 𝐷𝑃 𝑇 ,𝑦
cost [21]. Many GPU families have the same FP16 and FP32 throughput is the set of data-parallel accelerators of type 𝑇 proposed in the year 𝑦.
(e.g., NVIDIA RTX 20xx, AMD RX5xxx, AMD RX6xxx): their FPUs 𝑃 𝑒𝑟𝑓 may refer to throughput (Fig. 9(a)) or efficiency (Fig. 9(b)).
support packed data, so they are able to perform one 32-bit operation Fig. 9(a) shows that Manycore accelerators achieve the highest
or two 16-bit operations together [74,135]. NVIDIA made a different throughput for data-parallel workloads. GPUs were the type with the
design decision in its most recent generation (30xx): FP32 throughput highest throughput from 2018 to 2020. In 2021, the introduction of
is twice the FP16 throughput — which means that the correspondent Baidu Kunlun K200 [168] overturned the situation, surpassing the best
non-normalized FLOPS values are the same. GPU by 79.77% and increasing the maximum throughput achieved
The ongoing investment in small data-types (and reduction in FP64 by Manycore accelerators by 4.34× with respect to the previous best
support) is evident also in the Intel Xeon Phi family. Knights Landing performing one, Intel Xeon Phi KM7295 [129]. The superiority of
models [122–124] hit the market in 2016 with the same FP32 and FP64 Manycores and GPUs – which are closely related, as already explained

30
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Fig. 10. Power consumption (Watt) vs aggregated throughput (TOPS) for ML accelerators. (For interpretation of the references to color in this figure legend, the reader is referred
to the web version of this article.)

– seems to suggest that this is the way to design high-throughput • Very-low: less than 0.2 W;
data-parallel accelerators. • Mobile: range 0.2–10 W;
On the efficiency side, Fig. 9(b) highlights a different situation: Plas- • Laptop: range 10–50 W;
ticine’s introduction in 2017 [250] improved the maximum efficiency • Desktop/server: range 50–500 W;
of about one order of magnitude. In the period 2017–2020, two genera- • High-end: more than 500 W.
tions of GPUs were introduced in the market. The maximum efficiency
doubled in the last one, but they achieved at most 0.39× Plasticine’s Again, each of them can be easily assigned to a group, with some
efficiency. Since these GPUs are destined to laptops, data-centers, and ambiguity for Atlas 200 AI [215] and Intel Nervana NNP-I 1100 [116],
supercomputers, we think that their designers took efficiency in high with 9.5 W and 12 W, respectively. It is interesting to note that
consideration – albeit less than throughput. In 2021, Baidu Kunlun laptop and desktop/server accelerators achieve higher throughput than
K200 [168] was introduced and beat Plasticine’s efficiency by 49%. In their data-parallel counterparts, displayed in gray in the background
our opinion, the fact that it took 4 years for an industrial manycore of the figure. Even DaDianNao [180], proposed in 2014, achieves
accelerator to score a better efficiency than an academic CGRA can be higher throughput than the most recent laptop GPUs, basically with
read as a proof of the superiority of the CGRA approach in terms of the same power budget. This happens because ML-oriented accelerators
efficiency and how promising it can potentially be. are tailored for a narrower domain than data-parallel. Their designers
In both figures, the PIM type achieves the worst throughput and can pursue specialization more aggressively, reducing flexibility and
efficiency. In our opinion, this is due to the fact that there is only one replacing control-flow units with purely functional ones.
PIM, IMP [226], which may be not enough to represent its category. As expected, the tendency of lowering data-type precision to seek
We will show in the next Subsection that PIM accelerators too may be throughput and efficiency is even higher in accelerators that explicitly
highly efficient. target ML workloads. In this case, the native data-types supported are
as low as single bits (INT1), the majority of them employ at most 16-
4.2. Machine learning bit data-types, and the few with 32-bit data-types (DaDianNao [180],
Baidu Kunlun K200 [168], and Graphcore M2000 [201]) achieve less
Fig. 10 shows the relation between power consumption and than 1 TBps/W efficiency. No accelerator manipulates FP64 data-types.
throughput in ML accelerators. There are GPUs, Manycores, Systolic, INT8 is the preferred data-type by many recent accelerators (2018 and
Spatial, PIM accelerators and one FPGA. Mixed-signal [238] and Yo- later) in the efficiency range 1 TBps/W–10 TBps/W. 16-bit fixed-point,
daNN [288] are displayed in blue because, since they target BNNs, they conversely, has no support in accelerators proposed after 2016.
replace MAC units with complement units and multiplexers and their The only accelerators displayed in both Figs. 8 and 10 are Baidu
throughput is calculated based on simpler operations than multiplica- Kunlun K200 [168], described before, and NVIDIA GPUs. The latter,
tions. For this reason, we display them in the figure, but avoid any despite being Data-parallel accelerators, are widespread also to per-
comparison with other accelerators that would be unfair. form ML tasks, even at data-center scale. In Fig. 10, we display the
In this Figure, as much as five power-dependent groups can be throughput achieved by their Tensor Cores [135,138], advertised as
identified: “specialized execution units designed specifically for performing the

31
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Fig. 11. Maximum performance achieved until each year for each type of Machine Learning accelerator.

tensor/matrix operations that are the core compute function used in with 5.533 TBps/W. In 2020, Untether TsunAImi [282] came closer
Deep Learning” [135]. In both 20xx and 30xx generations, they are with 5.00 TBps/W. Overall, four accelerator types are close together
able to perform INT4, INT8, and FP16 calculations with the same with efficiencies in the range 4–6 TBps/W: PIM, Spatial, Systolic,
throughput. We think this is due, also in this case, to the ability of Manycore.
using packed data.
Intel Stratix 10 NX FPGA [362] is the only FPGA in the figures. 5. Open challenges
FPGAs’ achieved throughput and power consumption greatly depend
on the configuration adopted, and a general discussion cannot be done In this Section, we discuss some open challenges that affect accel-
regardless of it. However, this FPGA features AI-optimized compute erated architectures horizontally. For each of them, we describe the
blocks and Intel discloses all the relevant data (data-type, aggregated problem in order to let readers grasp its characteristics and why it
throughput, and efficiency). matters. Then, we discuss some state-of-the-art techniques adopted to
The maximum throughput achieved are: Untether TsunAImi [282] address it and also some prospective research directions.
(2.00 PBps, INT8), Graphcore IPU-M2000 [201] (2.00 PBps, FP16 and
1.00 PBps, FP32), Geforce RTX3090 Tensor Core [141] (284.00 TBps 5.1. Tackling the memory wall problem
INT4), Intel Nervana NNP-T 1400 [121] (216.00 BF16), PUMA [256]
(104.62 TBps, FixP16), Eyeriss V2 [189] (356 GBps, FixP8), Since its formulation in 1995, the memory wall problem [300] has
Origami [247] (294 GBps, FixP12), Multibit [239] (7.00 GBps, INT2). been dominating the debate. It predicted the widening of the perfor-
The maximum efficiencies achieved are: Multibit [239] (5.533 mance gap between memory system and processor, and the prediction
TBps/W, INT2), Untether TsunAImi [282] (5.00 TBps/W, INT8), Pu- proved correct over the years. Today’s processors suffer the limitations
DianNao [255] (3.544 TBps/W, FixP16), Atlas 200 AI [215] (2.316 imposed by the bandwidth and latency of the memory system to the
TBps/W, FP16), FloatPIM [191] (1.637 TBps/W, BF16), Intel Stratix rate at which they can consume data and instructions. The performance
10 NX FPGA [362] (1.00 TBps/W, INT4), Graphcore IPU-M2000 [201] gap has become even more dramatic with the advent of highly parallel
(909 GBps/W, FP32), Origami [247] (655 GBps, FixP12), Eyeriss V2 systems, due to the consequent narrowing of the channel to the memory
(511 GBps/W, FixP8). reserved for each computing element [363]. Moreover, also from an
Fig. 11 shows the throughput and efficiency trends in ML accelera- energy point view, energy consumption in today’s systems is dominated
tors, grouped by type. It is analogous to Fig. 9 for ML accelerators. The by moving data back and forth rather than consuming them.
value 𝑉 assumed by the curve associated to ML accelerators of type 𝑇 Traditional solutions concentrated on limiting the impact of the
the year 𝑦0 is calculated as: problem, addressing it by improving both the memory subsystem,
e.g., speculative load, prefetching techniques, and miss rate reduction;
𝑉𝑇 (𝑦0 ) = max 𝑃 𝑒𝑟𝑓𝐴 and the processor, e.g., investing in thread-level parallelism as a way
𝐴∈𝑀𝐿
𝑇 ,𝑦
𝑦≤𝑦0 to have “work to do”, so to avoid cache miss stalls. Unfortunately, not
only multiprocessors are affected: every computing device that needs
with 𝑃 𝑒𝑟𝑓𝐴 with the same meaning as before and 𝑀𝐿 𝑇 ,𝑦
denoting the data or instructions from a distant memory resource may suffer the
set of ML accelerators of type 𝑇 proposed in the year 𝑦. performance gap between its compute elements and the memory system
Fig. 11(a) shows that, as today, the best throughput is achieved by providing operands. If a culprit is to be found, it must be identified in
Spatial accelerators. This is true since 2020, when Untether the distance between processing elements and memory that is dictated
TsunAImi [282] was proposed and improved the best throughput by by the von Neumann model.
3.57×. Until the year before, the best performing accelerator was the In the following, we analyze two main techniques adopted by ac-
Manycore Huawei Atlas 300T [220], but the year before the record celerator designers to address the memory wall problem, both sharing
was contended between PIMs and GPUs, with Systolic accelerators the need to move away from the von Neumann model.
just behind. This turnover seems to suggest that, in the case of ML
accelerators, there is not a neat distinction between the achievable 5.1.1. State-of-the-art solution: Memory-rich processors
capabilities of various types of accelerators. More than one candidate The first approach is the so-called memory-rich processor, which
seems to be a good fit to deliver high throughput. consists in bringing as much memory on-chip as possible, closer to
Fig. 11(b) shows the efficiency trend. Also in this case, there is no compute units. This same principle inspires the abundance of local
neat distinction: the best efficiency is achieved by PIM accelerators, that memories in accelerated architectures and the presence of ever larger
are competitive from 2018, when RobustInMem [268] was proposed. cache memories in general-purpose processors2 : the integration of more
It scored an efficiency 11.77% lower than the most efficient acceler-
ator at the time, the Systolic PuDianNano [255]. In 2019, the PIM
Multibit [239] and the Systolic Tesla FSD Computer NPU [277] were 2
At the time of writing, AMD is in the process of releasing EPYC 7003X
proposed, with the former still being the most efficient ML accelerator, “Milan-X” server CPUs with 768 MB L3 cache [364].

32
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Fig. 12. Accelerators designed according to the memory-rich processor principle.

and more high-bandwidth, low-latency memory closer to compute features two variations: near-memory computing is obtained by plac-
units reduces the frequency of transfers from/to the main memory, ing compute elements as close as possible to memory banks, being
improving efficiency, latency, and performance. them main memory, caches, even scratchpad memories; in-memory
Several accelerators are designed according to this principle. For in- computing is obtained by fusing in the same component memory
stance, the Manycores DaDianNao [180] and Graphcore and computing capabilities, either by augmenting traditional technolo-
IPU-M2000 [201], and the Spatial accelerators Cerebras WSE [172], gies with computing units or by adopting emerging technologies that
GroqCard [208], and Untether TsunAImi [282]. Fig. 12 shows block naturally provide both.
diagrams of DaDianNao, Cerebras WSE, and Untether TsunAImi taken The PIM idea is not new, as early proposals date back to 1970 [365].
from original papers, whitepapers, and presentations. Manycore and It was discussed again in the 1990s, but “at the time, the industry
Spatial accelerators have in common the presence on-chip of many PEs. moved towards increasing the off-chip memory bandwidth instead of
In the Spatial case, these are interconnected and can exchange data di- adopting the PIM concept due to costly integration of computation
rectly. The memory-rich principle applied to these accelerators involves units inside memory dies” [279]. That cost was drastically reduced
the addition of abundant on-chip memory that can be partitioned in 2010s thanks to the emergence of 3D stacking technologies, which
between PEs or not. allow memory and computing dies to be stacked together and exchange
With memory-rich accelerators, the need to exchange data with the data at high speed thanks to through silicon vias [23]. Apart from this,
main memory is not eliminated. On-chip memory needs to be fed with other two factors contribute to the PIM blooming we witness these days:
input data to process, instructions need to reach the PEs, and output the emergence of novel technologies that blur the line between memory
results need to be written back to the main memory. There is still and computation, like memristors, and the experimentalism hardware
the need to overlap processing and data movement, with the latter designers had to embrace as a byproduct and a reaction to the end
potentially limiting the performance of the accelerator. of frequency scaling and, in general, the slowdown in general-purpose
In these architectures, achieving an efficient use of on-chip memory processors’ improvement.
is a challenge per se. If on-chip memory is partitioned between PEs, The goal of near-data and in-DRAM accelerators is to reduce dras-
data movements must happen explicitly (e.g., with message passing tically data transfers between CPU and main memory, which are the
between PEs), but it could happen transparently in a shared mem- most expensive ones from both latency and energy points of view.
ory design. In both cases, data movements can lead to performance They cannot be eliminated altogether, since CPU likely continues to be
penalties, especially when data accessed are physically distant from the employed to program PEs and offload work to them — as it happens in
accessing PE. More or less, these systems present the same challenges the Heterogeneous-PIM [212] case, that is programmed in OpenCL and
that characterize distributed systems: each PE should have its data as receives work by a host-operated profiling-based scheduler. Therefore,
close as possible (inside its partition if the memory is partitioned), and if not data, at least instructions need to reach PEs [366]. However, both
data should move as little as possible. Depending on the programming DRAM and 3D-stacked technologies (i.e., HMC, HBM, HBM2) have a
model, achieving this may be the programmer’s responsibility or the modular design that affects the architecture of accelerators based on
effect of an automatic procedure (either optimal or heuristic-based) them: single PEs (that can be as small as few logic ports) are local to
as part of a building process. The latter is a common solution for single modules – banks in DRAM and vaults in 3D-stacked memories.
industrial accelerators: manufacturers usually provide it as part of ad- For this reason, also PIM accelerators are similar to distributed sys-
hoc backends to popular frameworks/libraries (e.g., TensorFlow [56], tems: each PE’s processing should be limited to data contained in the
PyTorch [305]). In any case, both programmer-driven or compiler- same module as much as possible, as communication across different
driven solutions can work if the accelerated application exposes a modules can be expensive and have a big impact on performance and
regular access pattern to the memory, so that either the programmer efficiency. The same considerations about programming models for
or compiler can arrange computations in a way that minimizes data memory-rich accelerators apply also in this case, and, unfortunately,
transits. Conversely, applications with an intrinsically irregular data also the observation that applications with irregular or data-dependent
access pattern, data-dependent in the worst case – thus, not knowable access patterns are not a good fit for these accelerators.
in advance – represent a big challenge for these architectures as much There are some specific challenges related to the technologies used
as for traditional ones. to achieve PIM. Memristors, which are the heart of ReRAM-based in-
memory accelerators, suffer from various issues: low precision [21,
5.1.2. State-of-the-art solution: PIM 206], ADC/DAC overhead [21], durability issues related to write op-
PIM is a complementary approach with respect to Memory-rich erations [265].
processors. It is based on bringing computation closer to the memory Traditional technologies present their challenges as well. In-SRAM
to reduce the energy needed to move data and increase bandwidth. It computing is performed by activating two or more SRAM word-lines

33
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

and sensing on the pair of bit-lines [23]. Logical operations between be unable to run even simple peripheral daemons/processes that may
word-lines can be implemented easily, but more complex ones like contribute in system management. Depending on the characteristics
arithmetic operations are bit-serial instead of bit-parallel, which in- of the accelerator, such management could be necessarily centralized.
creases the number of cycles needed to perform an operation. An However, even considering these differences, we think that future
alternative is to rely on mixed-signal computing (e.g., PROMISE [254], research should investigate strategies to map the best practices imple-
Mixed-signal [238]), that mixes digital and analog signals, but it works mented in tools such as Slurm [369] or Kubernetes [370] on these
well only for reduced precision, needs costly ADCs, and requires mod- accelerators.
ifications that decrease the memory density [23]. In-DRAM computing
Other promising research perspectives come from technologies that
is also achieved by activating more word-lines and reading the out-
have the potential to eliminate the performance and energy gap be-
come in the bit-line charges. This technique usually requires circuit
tween the memory system and the computing elements. In the shorter
modification, supports limited operations (AND and OR), is sensitive
term, some breakthrough technologies are on their roadmap for inte-
to charge variance, and is destructive of the word contents [23].
gration into mainstream products, and silicon photonics is one of them.
However, in-DRAM computing has proved historically difficult because
the integrated computing logic must be implemented with the DRAM It promises to reduce of around one order magnitude the physical
process, which produces slower processing elements with respect to communication latency on chip, thus making units, modules, zones
CPUs. This, unfortunately, is a problem common to all near-memory of chips closer in time. This can, from one side, reduce the magni-
and in-memory proposals, as no process is as optimized for speed as tude of nowadays wire-delay effect, re-enabling some design choices
CPU process. A reduced speed in computing logic may risk nullifying and, on the other side, might foster also new solutions otherwise
the performance gains of an even drastically reduced data movement, unfeasible [371].
and this is even more true when complex operations that require many For instance, one of the key cornerstones of HP’s The Machine [372]
cycles are involved. proposal was a notable usage of photonics to bridge distances and
glue-up the machine at various levels. Silicon-photonics interposers are
5.1.3. Perspective: PIM consolidation and new technologies present in the technological roadmaps since more than 10 years and are
Memory-rich and PIM accelerators, both near-memory and in- getting closer and closer to adoption. We can expect silicon photonics
memory, are already well-represented in the accelerator world. We to potentially be a key ingredient to enable novel architecture designs
argue that their presence in both market and academia will steadily based on PIM paradigm, HBM-/HMC-inspired technologies, etc. How-
increase in the next years as technological scaling will keep on ex- ever, photonics is intrinsically an end-to-end communication, different
acerbating the memory wall problem, and this processing paradigm
from the consolidated store-and-forward approach of the electronic
appears to be one of the main broad directions for stepping away from
NoC era, because it is inconvenient/impossible to perform computa-
von Neumann architectures. However, these solutions are currently in
tions (e.g., routing) in the optical domain. Therefore, specific, novel
their infancy, and need further research to reach maturity from both
design solutions will need to be investigated for blending in the intrinsic
technological and software perspectives.
opportunities of the technology [373].
The biggest technological challenge concerns the difficulty for PIM
accelerators to exploit fast processing elements, which is determined From a slightly different perspective on photonics, there are at-
by the different processes adopted for memory technologies and pro- tempts for computing directly in the optical domain [374]. For in-
cessors. Since the former is optimized for memory density and not stance, the approach maps to light parameters (e.g., polarization, inten-
for speed, it is a poor choice for computing logic. As PIM proposals sity, color, etc.) the state variables of data parallel elements needing,
become more and more widespread, we can expect novel techniques particularly but not necessarily, linear operations to be performed.
to be explored solely for this purpose, possibly looking for a trade- Computation happens through light filtering and the results are then
off between the two that may pursue higher computing speed at the converted into the electronic domain. Huge efficiency and latency bene-
expense of memory density, positioning “in the middle” between the fits have been reported for computations that can fit this scheme, which
two. In the case of 3D-stacked memories like HMC or HBM, it would are within very strategic domains like convolution, etc. For these rea-
be sufficient that this speed-oriented process is limited to the logic die. sons, accelerators based on photonics could well be envisioned in the
Much research is addressing the shortcomings of specific technolo- future and have the potential of positively addressing the memory wall
gies. Memristors’ low-precision issues can be mitigated by adopting and computational density problems for some classes of computations.
the splicing method, which consists in employing multiple cells for
different bits, later combined in high-precision numbers through shift–
add operations [192]. The durability issue is due to the high voltage 5.2. Reconfigurability beyond FPGAs
necessary for writes [171], and can be mitigated by lowering the
voltage. In turn, this would make writes less deterministic [171], so this In the last decades, FPGAs have established themselves as the main
solution is viable only for applications that can tolerate the resulting reconfigurable device. They are characterized by up to millions of
errors. In-DRAM related issues are also being addressed at various
logic blocks [14], memory cells, and specialized blocks; all intercon-
levels [23,186,367,368]. Overall, there is much research that aims at
nected by a statically reconfigurable interconnect. They offer a so-called
limit the shortcomings of particular technologies, and we can expect it
fine-grain, even bit-level [214,375], reconfigurability that make them
to continue and find increasingly effective solutions.
suitable for general-purpose computing. Although this high flexibility
From a programming point of view, there is a need for sophisticated
is the main reason for FPGAs’ success over inflexible ASICs, it comes at
programming tools targeting both memory-rich and PIM accelerators.
These should address the similarities between these architectures and a cost: 60% of FPGAs’ area and power are spent in the programmable
distributed systems, leveraging many techniques that maturated in interconnect, and the long combinational paths limit their operating
those contexts, but at a drastically different scale: minimizing data frequency [250]. Moreover, bit-level flexibility is often not necessary,
exchange between processing nodes, promoting data reuse at node as relying on functional units that implement logical and arithmetic op-
level, allocating frequently-communicating logical nodes on neighbor- erations according to one of the existing technical standards (e.g., IEEE
ing physical nodes, pursuing load balancing, etc. An important differ- 754 [376]) is usually good enough for the need of an application.
ence to consider is that, with respect to nodes commonly found in dis- Taking this into account, researchers have studied alternative ways to
tributed systems, these nodes could have severe limitations (e.g., they leverage reconfigurability that are not affected by the aforementioned
could be very simple PEs featuring only few logical ports) and could limitations.

34
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

5.2.1. State-of-the-art solution: CGRAs 5.2.3. Perspective: HRPA


One example is the CGRA design (e.g., HReA [213], HRL [214], CGRAs have the potential to gain a wide adoption in the market, but
NDA [241], NP-CGRA [246], Plasticine [250], PX-CGRA [257], RE- there are various challenges connected to their effective exploitation
MUS [267], Softbrain [10], X-CGRA [287]), which embraces a lighter that are yet to be solved. In our opinion, between them, the difficulty
form of reconfigurability, so-called coarse-grain, and was first proposed of effectively and productively program them is the biggest obstacle to
in the 1990s [18]. Same as PIM, CGRAs gained attention only recently, their adoption. However, even if CGRAs as they are will not gain wide
because of the slowdown in general-purpose processors’ improvement adoption, there is one lesson that can be learned from them: reconfig-
that pushes the need of exploring novel architectural solutions. They urable logic with a fast reconfiguration time is a valuable architectural
are an example of spatial architecture, designed as arrays of recon- component, as it could be used to implement critical logic efficiently,
figurable PEs, equipped with word-level functional units and usually while not hampering the support for temporal computations.
multi-banked SRAM memory. PEs are connected through a reconfig- Fast reconfigurable logic could be employed in a novel architec-
urable interconnect less complex than that of FPGAs. With respect to ture that would have various advantages: a Hybrid Reconfigurable/
these, they are less flexible (word-level vs bit-level) in both compu- Programmable Accelerator (HRPA). Unlike accelerators like Xilinx Ver-
tation capabilities and interconnect, but this gives performance and sal ACAP [152] where programmable (Scalar and Intelligent Engines)
efficiency gains, alongside an additional property that acts as a fun-
and reconfigurable (Adaptable Engines) are separated in a tiled archi-
damental enabler for many use cases: the more limited spectrum of
tecture, HRPA would take advantage of a tighter integration of the two,
reconfigurability allows ns-μs reconfiguration time [18]. Unlike FPGAs,
e.g., with a spatial array of programmable engines interconnected by a
that have a reconfiguration time in the order of ms-s, CGRAs can be
reconfigurable interconnect.
effectively used for temporal computations. Different configurations can
In our vision of such an architecture, every PE would represent
alternate rapidly, leading to an execution flow that closely resembles
a processing node, and as such would be responsible for performing
that achieved by instruction-driven programmable architectures. As a
a task. Each task could be expressed as a function that processes
counterpoint, the reduced flexibility means that each CGRA is only
suitable in its domain, unlike both general-purpose processors and input data and produces output data. The instruction-driven PEs would
FPGAs. Taking everything into account, we think that FPGAs will keep allow functions that execute an arbitrary sequence of instructions,
their position as the preferred solution to implement prototypes, but conveniently expressed with general-purpose programming languages.
the interesting characteristics of CGRAs can lead to an expansion in Connection between PEs could be expressed with point-to-point ab-
the market of these architectures. stractions like Golang’s channels [378], and be mapped on the recon-
At the moment, the various proposals coming from Academia dis- figurable interconnect as routing paths. If reconfiguration time is in a
cussed in this survey show that CGRAs are an active research area. much different time scale with respect to the instructions, a productive
However, they are almost non-existent in the Industry, although some use of such an accelerator would be difficult to achieve. Conversely, if
applications exist, as briefly listed in [18]. The demonstrated value reconfigurability time is in the ns-μs range, as in CGRAs, such hybrid
in terms of performance and efficiency will ultimately lead them to reconfigurable/programmable architecture could be controlled in a ho-
find their way into Industry, but there are still some open research mogeneous way, with a special instruction/command packet destined
questions and challenges that need to be addressed before this can to set the configuration the same way as other instructions/command
happen. In [18], Liu et al. identify in programmability, productivity, packets would control programmable resources. Such an accelerator
and adaptability the main areas where CGRAs fall short. They present would be a good mapping for dataflow workloads and graphs with
four challenges that CGRA designers need to address and give their non-trivial processing steps at a node level.
take on possible research directions: define a productive programming HRPA would present some challenges by itself that are typical of
model for CGRAs that is able to produce efficient code with minimal spatial accelerators, like achieving efficient clocking/synchronization
manual intervention; improve the support for speculative parallelism, strategies or an efficient exploitation of the memory subsystem. How-
that is an important source for performance; improve CGRA virtualiza- ever, it would have the potential of being easily adoptable from a
tion by taking inspiration from techniques that proved successful for programming point of view and have an efficient way to manage data
FPGAs; improve the efficiency of the memory system by investigating routing between processing components.
vectorized/streaming memory accesses, dynamic customization of the
access pattern, and PIM techniques. The first challenge, in particular, 5.3. Designing hierarchical accelerators
is the most important to see a wide adoption of CGRAs in the market,
as argued also in [377]: “the challenge with CGRAs is the development
Beside the trend of integrating more and more relatively simple
of a tool flow to schedule word-level coarse-grained computations into
accelerators on heterogeneous SoCs, a complementary one is emerging.
a mesh of PEs with a computational efficiency and functional density
Some accelerators are evolving towards embracing the acceleration
that surpasses FPGAs. Such tools should provide a way for the user to
concept within themselves, thus forming de-facto a fractal-like archi-
program the machine efficiently using popular high level programming
tectural pattern at system-level. In other words, accelerators have
languages like C/C++ ”. In Section 5.6 we argue that this is the most
taken momentum due to the increasing limitations and technological
important aspect for accelerator adoption in general.
constraints limiting the performance/energy scaling of general-purpose
5.2.2. State-of-the-art solution: Limited reconfigurability units. Then, some of these specialized accelerators have started to
Apart from investing in CGRAs, another possible solution is to grow modules within themselves to make some specific sub-tasks more
exploit reconfigurability by limiting it to a single aspect, like routing- efficient or faster and forming a self-similar “hierarchical” architecture.
only, as in FPSA [192], and Micron Automata Processor [231], or This trend witnesses that the same limitations occurring in general-
computation-only, as in DRISA [186]. Accelerators could take advan- purpose units affect also some accelerators with a relatively wide
tage of partial reconfigurability for critical aspects only, so to achieve application spectrum, to which they respond embracing deeper internal
higher performance and efficiency figures with respect to equivalent specialization. Therefore, these accelerators not only target special-
full instruction-driven solutions. FPSA [192], Micron Automata Proces- purpose tasks offloaded by the cores, but they feature also some specific
sor [231], and DRISA [186] have no instruction-driven components at hardware blocks, even more specialized, that push the boundaries
all. In these accelerators, reconfigurability is responsible for providing of performance and energy efficiency. The challenges connected to
some degrees of freedom to their users, but do not innovate consistently these accelerators concern a fruitful cooperation between these dif-
the classic approach to reconfigurability: they limit its scope with ferent modules, as this heavily affects performance and efficiency of
respect to FPGAs, placing halfway between an ASIC and an FPGA. applications that take advantage of more than one module.

35
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

5.3.1. State-of-the-art solution: modern GPUs and FPGAs impacted, as programmers could fail in achieving an efficient over-
Various examples can be found between complex, wide-spectrum lapping between data movement and computation phases, introducing
accelerators, usually designed as PCI-e cards. For instance, high-end unnecessary serializations.
GPUs, which are mainly made up of tens of multi-processors featuring In order to solve this criticality, many researchers and organizations
simple, in-order cores, optimized for graphics and data-parallel work- like the Heterogeneous System Architecture Foundation (HSA) [381]
loads. NVIDIA introduced with its Volta data-center architecture Tensor are patronizing the support for a Unified Virtual Memory (UVM) space
Cores [379], specialized in MAC operations that characterize common between host processor and accelerators. This would allow application-
ML workloads like CNNs. With Turing [135], they introduced Ray Trac- level pointers to be passed seamlessly between host-code and
ing Cores, which are used to calculate reflections, refractions, shadows, accelerated-code, without the hassle of explicitly managing replication
and indirect lighting in real-time. AMD took a similar path with RDNA2 and consistency.
architecture [78], with the introduction of Ray Accelerators, which are UVM can be achieved whether the accelerator shares a level of the
dedicated cores specialized in calculations of ray intersections. In both memory hierarchy with the host (e.g., integrated accelerators) or not
cases, Tensor Cores and Ray Tracing Cores/Ray Accelerators perform (e.g., PCI-e cards). In both cases, data must be kept coherent so that
operations that could be performed also by the standard in-order cores, modifications performed on the CPU are visible on the accelerator,
but are more suitable for them from a performance perspective: the and vice versa. In the latter case, explicit copies between two different
same relationship that exists between accelerators and general-purpose physical memory spaces are still needed, but can be demanded by
processors. the host-accelerator interfacing logic. There are already interconnect-
FPGAs are undergoing a similar process. Xilinx introduced them in level protocols that implement it (e.g., CCIX [46], CXL [382]), or
1984 with few programmable LUTs and switches [380], and evolved in accelerator-specific high-level APIs that hide copies under-the-hood
size and complexity with a variety of logic cells, LUTs, flip-flops, DSP (e.g., CUDA [65] and OpenCL [57,58,309] Unified Memory). In both
slices, etc [148]. The real hierarchical step has been done recently, when cases, the coherence mechanism must involve all the levels of the
Xilinx introduced ACAPs [152]. They are conceived as an accelerator memory hierarchy: a datum written on the accelerator could be cached
with three main modules interconnected by a NoC: Scalar Engines, at all levels of the CPU cache hierarchy, and a subsequent read by the
which are general-purpose ARM processors, Adaptable Engines, that are CPU of its copy in the L1 should read the same value written on the
made up of programmable logic and memory cells (similar to classic accelerator. The opposite is true, as the CPU could modify a local datum
FPGAs), and Intelligent Engines, that are VLIW PEs specialized in ML and the updated value should be also read on the accelerator — which
and DSP operations. could have a complex memory hierarchy by itself. Data coherence
between CPU and accelerator, due to the many memory levels involved
5.3.2. Perspective: architectural and programming integration and the variety of possible situations, is a demanding challenge, and it
We expect this trend to be more and more widespread in the design must be addressed to achieve a Unified Virtual Memory space.
of future accelerators and thus we expect this pattern to be an active
research and development direction. In particular, the programming 5.4.1. State-of-the-art solution: Virtual memory support
layers and the related ecosystem should address this hierarchical nature In order to have a UVM space shared between CPU and accelerator,
by investing in a seamless interaction between on-accelerator modules. both of them need to support virtual memory. While it is a very
For those cases that would normally require different programming common requirement for modern CPUs, the situation is not the same
approaches, the particular characteristics of these should be blended accelerator-side. The main cost is the hardware support of virtual-to-
to achieve more natural programming models that stress the interop- physical address translation [383,384], and thus specific techniques try
erability, even across different domains. Currently, there is not much to minimize this aspect.
attention to this issue: for instance, CUDA Tensor Cores do not in- For instance, in [384] (2015), Vogel et al. concentrate their at-
teroperate with CUDA cores and should be treated with special API tention on many-core accelerators in heterogeneous embedded SoCs
functions. From an architecture point of view, the design of resources (even if their approach can be extended to any accelerator with direct
shared between different modules (e.g., on-board memory resources), access to the main memory), and propose a lightweight virtual memory
the relative placement of modules on the accelerator to promote inter- support for them. Starting from the consideration that a proper imple-
action or isolation, and even the wiring and connections (or lack of) mentation of IOMMU in that case could be unaffordable for area and
between different modules are all aspects that need accurate planning, energy consumption constraints, they propose a Remapping Address
so to optimize the accelerator layout, performance, and efficiency, Block (RAB) as a software-managed replacement for the IOMMU. It
taking into account mono-module and multi-module workloads. In is a simplified software-only approach that uses a kernel-level driver
general, both programming model and architecture should be designed module and a user-space runtime to remap virtual addresses to physical
with a systemic approach that promotes cooperation between modules, addresses. The miss penalty is mitigated by prefetch, and prior knowl-
edge of the memory access pattern is used to initialize the relevant data
so users can take advantage of the peculiarities of each in complex
structure. Despite its simplicity, the technique delivers a 2× speedup
workloads naturally, without artificially splitting their programs.
with respect to classic copy-based solutions [384].
Also Hao et al. target heterogeneous SoCs [383] (2017). They pro-
5.4. Unified virtual memory space
pose an efficient address translation support for accelerators that share
the physical memory with the CPU. They observe that accelerators
In the most common memory model adopted in heterogeneous sys-
that support virtual memory do it by the means of IOMMUs with
tems, host processor and accelerator memory spaces are separated. The
support for I/O Translation Lookaside Buffers (IOTLBs) and some logic
host processor manipulates virtual memory pointers and transparently
to walk the page table. In these accelerators, the latency involved is
leverages cache coherence protocols that move the burden of data
very long and the performance reaches only 12.3% of the ideal address
consistency to hardware and O.S. level. Accelerators may manipulate
translation [383]. Their proposal is composed of three elements:
virtual memory or even physical memory, but usually have their own
memory space that interoperates with host memory through explicit, • A small (16–32 entries), private TLB to save accesses to the
DMA-mediated copies. This poses a heavy burden on the programmer: IOMMU;
they must address this distinction by managing couples of pointers, • A level 2 TLB shared between accelerators to filter translation
one for each memory space, copying data back and forth between requests on common pages;
the two explicitly. Apart from programmability, performance may be • An interface to offload TLB misses to the host MMU.

36
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

This scheme achieves 93.6% of the performance of an ideal address cache&flush approaches. Because of this, different allocation primitives
translation, but requires important additions to existing accelerators (e.g., CUDA [65]) or, equivalently, allocation primitives with memory
and SoC architecture. flag parameters (e.g., Vulkan [63], OpenCL [57,58,309]) may be ex-
In general, using TLBs in accelerators poses challenges, because posed to the programmer, so they can control the intended outcome.
it requires either (a) processor-specific page-walking strategies to be The different policies are implemented by runtime and drivers, with
implemented accelerator-side, which is very complex and not portable; several implications on performance [387]. We can expect this trend
or (b) relying on the CPU translation only in case of accelerator TLB to be mimicked also in other accelerator domains, as it is flexible
misses, which is faster on average but still requires complexity on the enough to support various use-cases, requires no modifications at the
accelerator; or (c) have all translations done in the CPU, which reduces hardware level, but has the drawback of increasing the programmer’s
accelerator’s complexity at the expense of performance penalties. It can responsibilities. Simultaneously, we are already witnessing proposals
be expected that all the intermediate shades between macro-approaches based on automatic and semi-automatic mechanisms that keep track
from (a) to (c) will be adopted by different accelerators. For instance, or even infer the intended memory usage to adopt selective caching
considering the typical workloads, intense usage patterns, probably policies [388].
finer-grain, between accelerator and unified memory could push to- Second, the potential impact that the presence of a cache coherent
wards the (a) extreme, while more infrequent and bulky transfers are accelerator may have on the system. Taking again GPUs into considera-
more likely to promote the implementation of solutions close to (c), as tion, they tend to consume data at a high bandwidth. Adding naïvely a
the high translation overhead would be amortized. GPU to a MESI/MOESI protocol as if it were another core may seriously
A different approach can be found in [385] (2018). Haria et al. hamper CPU’s functioning, as the GPU would fill caches with its data
propose a scheme to keep virtual memory as-is but to eliminate the and wipe out whatever was there before, like cores’ and other accel-
need for address translation in the majority of cases — in their words, erators’ data. This is a well-known research problem going back to the
to “devirtualize virtual memory”. Their approach is organized in the first attempt to put cores and a GPU on the same processor die: already
following steps: in 2011, integrated GPUs in AMD Fusion APUs distinguished between
uncacheable and cacheable memories [389], giving to programmers
• Memory is allocated such that physical memory and virtual mem-
the opportunity to recur to uncacheable memory whenever possible.
ory are almost always identical (identity mapping);
However, to properly handle those cases when caching is applied and
• Memory protection is enforced by checking application permis-
coherence is involved, various solutions have been proposed. Some
sions for each access.
examples are cache partitioning [390–392], cache locking [392], and
Their De-Virtualized Memory (DVM) reduces the translation over- other modifications to the cache functioning [393]. Also in this case,
head to less than 2%. It also gives advantages in programmability, we can expect these techniques to be adopted for other high-bandwidth
power/performance, flexibility, and safety [385]. It must be noted that accelerators that may suffer from the same foundational issue. We
it requires small modifications to the O.S. to support identity mapping, expect to be one of the top research and development priorities in the
and imposes further constraints by construction — like eager paging, evolution of accelerated architectures.
that is more expensive in terms of required disk space.
5.4.3. Perspective: Different solutions for different use cases
5.4.2. State-of-the-art solution: Cache coherence techniques As we observed, virtual-to-physical translation is the most critical
Many works analyze the implication of cache coherent accelerators. design choice to implement virtual memory support on accelerators.
For instance, in [303], Giri et al. discuss the SoC case, which is usually The various possibilities (accelerator-side translation, accelerator-side
characterized by the sharing of physical memory between CPU and translation with CPU intervention on TLB miss, CPU-side translation)
on-chip accelerators. In particular, they study the impact of coherence have pros and cons that make the best choice dependent not only on
on performance in the case of accelerators designed in isolation and the accelerator’s characteristics, but also on the application. In the case
then integrated into the SoC. They identify in the literature three of simple, fixed-function accelerators, the best choice can be uniquely
cache-coherence models: determined, but non-trivial accelerators exposing programmable or
configurable operations do not permit this. For instance, the need of ef-
1. non-coherent;
ficient management of scalar and small-sized structures for configuring,
2. coherent with the last-level-cache;
controlling the accelerator activities while in operation, monitoring,
3. fully-coherent.
and signaling could require efficient low-latency autonomous transla-
Non-coherent accelerators access directly the DRAM through DMA tion – e.g., in contexts where an accelerator is part of a pipeline of
bypassing the cache hierarchy. In order to get coherent data, the streamed data elaboration between itself and one/multiple cores. In
processor caches must be flushed upon accelerator access. The support such applications, also big data buffers need to be moved in and out
for DMA bursts is necessary in this case to be efficient. LLC-coherent of the accelerator for processing, so we can imagine that smart DMA
accelerators use the DMA to access the LLC. In this case, the re- engines will be refined to manage efficiently both situations: buffer
quested data region must be flushed only from the private caches of copying, along with the finer-grain and fast transfer support. Therefore,
the processor. This model is efficient only if LLC has a high hit rate. the best design choice would be to investigate the coexistence of
Fully-coherent accelerators are included with the processor’s caches various strategies, and also automatic techniques to determine which
in a classic coherence protocol, such as MESI or MOESI. In this case, of them would be the most convenient for a given translation task.
no flushing is necessary. They also identify a fourth model, which is In general, whenever TLBs are replicated, coherence needs for their
implemented by ARM’s AXI Coherence Extensions (ACE). It maintains content emerge and pose related challenges. Either accelerators’ TLBs
uncached accelerators coherent: any shared access to memory can expose a HW/SW interface similar to that of general-purpose cores and
“snoop” into the caches [386]. are visible to the O.S. as nodes of the virtual memory management, or co-
In order to opt for one policy or another, a variety of factors herence maintenance will require ad-hoc solutions at a certain level in
must be taken into account. First, the fact that cacheability and co- the drivers and/or runtime. This situation pushes from one side towards
herence requirements may be dependent on the application needs a standardization of the Virtual Memory interface of accelerators and
even within the same accelerator. Considering GPUs, for instance, raises research opportunities on novel OS-cores-accelerator interaction
GPGPU’s buffers and control structures may need to be coherent patterns.
because they are read/written by GPU and CPU very often. Con- From a different angle, we can expect that DVM-inspired approaches
versely, tasks like display/render need no coherence and benefit from could be blended with TLB-based ones through hybrid solutions trying

37
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

to limit each other’s weaker facets, maybe also assisted by ad-hoc 5.5.2. State-of-the-art solution: API remoting
functionalities implemented in the cache hierarchy. Yu et al. well summarize the difficulty of finding a general so-
Our analysis highlights the necessity of supporting various scenarios lution from a full-software perspective [408]. They identify in the
for coherence strategies as well. Simple accelerators with straightfor- opaque proprietary programming layers (user-mode API, user-mode
ward needs could allow designers to opt for a neat strategy (e.g., non- drivers, kernel-mode drivers) the impossibility to virtualize accelerators
coherent accelerator), but complex ones with various use cases would by interposing virtualization code in the middle of the stack. Their
suggest that different strategies should be allowed to accommodate dif- approach is to rely on API remoting: maintaining an API server on
ferent application needs. The availability of cacheable and uncacheable the host or in a dedicated VM that intercepts user-level API calls.
buffers, which is common in GPU-oriented APIs (both graphics and They propose Automatic Virtualization of Accelerators (AvA), which
data-parallel), could extend to other accelerator APIs and even in high- provides a declarative API specification language. Starting with an API
level programming languages to ease interoperability with accelerator- specification in the form of an annotated header, AvA virtualizes it and
oriented libraries. So, for this aspect, in particular, we think that a generates a paravirtual3 communication infrastructure [408].
AvA targets the highest level of the software stack, leaving the lower
general solution is difficult to achieve, and giving to programmers
layers unaltered. The major criticism that can be moved to this ap-
the control could be a viable choice to achieve high performance or
proach, which is common to those solutions based on API remoting, is
efficiency in various scenarios, avoiding unnecessary cache traffic when
that it has a very difficult goal to achieve: a one-to-one correspondence
possible, and opting for coherent buffers when necessary. In this regard,
between original and remoted API including syntax, semantics, and
a consolidated standardized interface for managing these aspects is
side effects. This requires a deep understanding of the user API and
not available yet, but the increasing needs are pushing significantly
the ability to interpret it correctly, which is severely impacted by the
research and development advances in this strategic direction.
possible presence of undocumented constraints, the size of the API, and
its stability over time.
5.5. Virtualization
5.5.3. State-of-the-art solution: SoC hardware module
In data-centers, different users’ run their applications in virtual As an example of an opposite approach, Govindarajan et al. propose
environments that share the same physical resources and provide en- the use of a dedicated, programmable microcontroller on SoCs to proxy
capsulation, protection, and security. According to the current trend, requests from multiple sources and schedule them on the corresponding
both the number of users and the number of applications will likely devices [409]. A software entity (driver or runtime) would provide
increase in the next years, as we are already witnessing the migration of additional logic to identify the source of requests, generate the trans-
many services into the cloud. Accelerators are already being deployed action credentials, and orchestrate scheduling on the accelerator. The
in data-centers [394,395] and will likely play a prominent role in the credentials are used to tag transactions with client information, later
used to perform translations in system-level IOMMU on a per-client
near future, as their interesting characteristics in terms of promised
basis.
throughput and efficiency are particularly amenable in the energy-
constrained context of data-centers. They can be made available to
5.5.4. Perspective: A truly general solution
users by ad-hoc services that manage request queuing and scheduling,
We acknowledge the difficulties analyzed by Yu et al. [408] in inter-
but the most straightforward way is to grant access from virtual envi-
vening in the lower programming layers, as these are inter-dependent
ronments, giving to users the sensation of exclusive ownership that they
and opaque, and we think that a more complete solution should target
would have on their personal workstation. To achieve this, accelerators
the hypervisor or even the underlying hardware level. As we high-
need to be virtualized.
lighted in Table 8, the set of user-space APIs and frameworks is already
Virtualization is important also to support remote accelerators: a big and even fast-changing. Unless a proposed solution can address
virtualized remote accelerator would be seamlessly used in a user ap- all of them transparently, approaches that require a translation, albeit
plication as if it were physically present on the current machine, while semi-automatic, of each API are not feasible at the current state of
the backend would take care of forwarding commands to the nodes that things. The risk is to translate a subset of APIs, allow users to run a
physically host the accelerator. This would allow a more natural way subset of applications in the virtual environment and force them to
to share limited resources and promote interesting computing scenarios port the remaining ones. As users are steadily migrating to the cloud
– e.g., fog computing [396]. an increasing number of services, an ideal solution would allow them
to seamlessly port their applications and run them in the cloud.
5.5.1. State-of-the-art solution: Per-accelerator approach In our opinion, the solution proposed by Govindarajan et al.
Some accelerators already support virtualization and are success- in [409] goes in a promising direction. Its support of software-driven
fully employed in data-centers. Few examples are AMD’s GPUs, that scheduling is fundamental to foster various scenarios. The simplest
example is an automatic serialization of concurrent accesses to the
mount a hypervisor agent [74], NVIDIA GPUs, with their NVIDIA vGPU
accelerator, so to grant exclusive access to one client at a time. It
technology [138], and Google TPUs, that recently announced Cloud
is the most appropriate technique for simple accelerators that do not
TPU VMs [397] to control TPUs from virtual machines. Other examples
have instruments to manage various clients executing together, and also
include FPGAs [398–400]; Xeon Phi [401]; GPUs in automotive envi-
the most effective to achieve isolation. Conversely, more sophisticated
ronments [402]; and GPUs on servers with API remoting [403,404],
accelerators could take advantage of on-board scheduling or parti-
hardware-assisted [405,406], and all-software [407] techniques. These
tioning techniques that would allow multiple concurrent applications
solutions are very specific to selected scenarios, and cannot adapt to the
to execute even in a sand-boxed, secure way. In the first case, the
general case. Although individual accelerators present individual chal-
software-driven scheduling would not be necessary, and could even
lenges related to their architecture, whether they are reconfigurable or
act as a pass-through and rely on on-board scheduling techniques.
not, and their programming environment, a general solution should be In the second case, the interaction would prove more complex, as
investigated. It would benefit accelerator designers and users, freeing the software entity should be aware of the resources needed by the
the former from designing ad-hoc solutions and giving to the latter
the ability of using any accelerator in virtual environments. The first
question to answer in search of a general solution is what level of the 3
Intended as the adaptation of a software interface so it can be used in a
hardware–software stack it should target. virtual environment.

38
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

transaction and the current resource availability on the accelerator. 5.6.1. State-of-the-art solution: GPUs and the data-parallel ecosystem
A tight integration with proprietary virtualization techniques would The diffusion of GPUs complicated the picture even more. The
probably be necessary. first examples of GPGPU were an attempt of adapting GPUs’ highly
In order to be truly general, the solution proposed by Govindarajan parallel architectures to perform workloads not related to graphics.5
et al. in [409] needs to be extended to target not only SoCs, but also At first, they were coded by the means of shaders, but the appear-
PCI-e based accelerators. This can be achieved by moving part of the ance of NVIDIA CUDA programming language in 2006 [65] allowed
logic into the PCI-e controller or by expanding the PCI-e architecture à researchers and practitioners to switch to a more natural tool. CUDA
la CCIX [47]. As virtualization needs increase over time, we may expect is an extension of the popular C++ programming language. It features
in the future a convergence on a standard, vendor-independent solution a special syntax to launch kernel functions on the device and a few
that accelerator designers could adopt as a common interface to provide
keywords to tag kernel functions, device functions, and data to be
virtualization capabilities.
stored in special memory resources. It has its compiler (nvcc), runtime,
and a rich ecosystem that includes debugger, profiler, and libraries
5.6. Programming accelerators
(linear algebra, NN, RNG, etc.). Although it succeeds in abstracting
Today’s world is heavily shaped by information and the many various details of the underlying architecture away, it still requires
ways it can be manipulated and used. Computers, in their diverse programmers to write complex code structured as a 1D/2D/3D grid of
incarnations ranging from wearable devices to data-centers, are the 1D/2D/3D blocks of indexed threads that execute in groups of 32 at
main instrument for this work. Governments, companies, universities, once (warps), manipulate data in the device memory with the help of
research centers, and even private citizens maintain billions of lines special low-latency memories (shared, constant, texture) optimized for
of computer code4 that shape every aspect of modern life; they are different access patterns (coalesced access, broadcast, spatially local),
the backbone of uncountable technologies and services we use every and can synchronize both at block-level or at warp-level. Despite its
day, and have a primary function in driving scientific progress as a complexity, CUDA contributed fundamentally to the diffusion of GPUs,
whole. These lines of code would be orders of magnitude more and arguably the most widespread accelerators today.
would be unmanageable without the advancements of the past decades In 2009, OpenCL [57,58,309] was proposed as the first and still
in high-level programming languages and compiler technology. most popular attempt at achieving code portability across different
High-level languages allow programmers to formulate elaborate platforms. It is a vendor-independent data-parallel framework that
operations and abstract concepts with few expressive lines of code. Not targets multi-core CPUs, GPUs, and even FPGAs — so, it is not even
only they offer a degree of abstraction and compactness that limits the limited to data-parallel accelerators, albeit its programming model
program size, they are also closer to natural language than assembly or is purely data-parallel. It is deeply inspired by CUDA, but adds the
machine code, and allow expressing the entities representing both the possibility of writing truly heterogeneous code, as different platforms
problem-space and the solution-space more directly. As a consequence, can be targeted in the same executable. However, it has a lower-
writing, reading, maintaining, and debugging code is easier — in other
level interface compared to CUDA, and despite its higher portability,
words, programmer’s productivity, programs correctness and reliability,
OpenCL’s adoption is still lower than CUDA’s, also thanks to the rich
as well as systems safety are considerably higher. This is due also
ecosystem of libraries and tools of the latter. Currently, other solutions
to the concomitant advancements in compiler technology. Modern
like Sycl [59], Kokkos [355], SkelCL [356], and PHAST [360] try to
compilers include tremendous optimization capabilities that can target
conciliate performance, portability, and productivity, often relying on
performance, power consumption, or size of the produced code even
starting from very high-level computations and data structures. Thanks lower-level approaches like CUDA and OpenCL as backends hidden to
to them, most of the time today’s programmers can concentrate on the programmer. However, despite their higher level coding facilities,
high-level coding only, relying almost completely on their compiler these solutions are not replacing existing lower-level ones, yet. The
and avoiding tedious and error-prone hand-tuning of the compiled code momentum of industry and big project codebases promote the consoli-
altogether. dation of existing approaches, as usual, while new and smaller projects
Fifty years of high-level languages and compilers advancements should embrace the vanguard of new approaches.
are supported by a flourishing of an ecosystem that includes libraries,
frameworks, debuggers, IDEs, deployment, and monitoring tools. How- 5.6.2. State-of-the-art solution: The machine learning ecosystem
ever, this rich ecosystem is mostly about computing on general-purpose While Data-parallel is a wide domain that encompasses a great
processors, and its evolution led to the point that the underlying archi- variety of applications, ML is not as wide, and the majority of ML
tecture has been progressively abstracted away. General-purpose code applications involve either running or training a model, usually a NN.
is mostly portable, at least for what concerns the target architecture that A successful framework in the Data-parallel domain needs to provide
usually comes into play only at compilation time. enough abstraction to achieve productivity, while still guaranteeing
The emergence of multi-core processors suddenly broke the ab- enough freedom and fine-grain control to express an uncountable vari-
straction layers, putting architecture back in the foreground, even ety of applications with few in common (e.g., embarrassingly parallel,
in high-level languages that strove to keep it hidden from the pro-
mono-/multi-dimensional stencil or scan, histogram, etc.). A successful
grammer. Parallel programmers need to think about their programs as
ML framework needs to help programmers to automatize the expression
aggregates of separable threads of execution, to be kept as independent
of the structure of the problem, so they can work on its tuning. For this
as possible, minimizing synchronization points, and avoiding undesir-
reason, the most popular frameworks and libraries in the ML domain
able outcomes such as data races, deadlocks, and load unbalancing.
have generally a higher level of abstraction than their data-parallel
Some libraries and frameworks help them to achieve this and take care
of some complexity (e.g., OpenMP [60], C++17 Parallel STL [411], equivalents (e.g., managing NN layers and relationships between them).
etc.), but they are mostly data-parallel frameworks. Abstracting away Most popular ML-oriented libraries and frameworks commonly used
multi-threaded and multi-core parallel programming and delegating today allow programmers to code applications that present several
these activities to a parallelizing compiler is still an open research opportunities for data-parallelism. GPUs were already widespread and
area [412–415], which is expected to receive great effort from both were the most common data-parallel platforms, together with multi-
research and companies in the near future. core CPUs, when these tools first hit the market: Caffe [64] was

4 5
In 2019, the GHTorrent dataset contained 125 million projects from These workloads are best described as data-parallel, albeit the GPGPU
GitHub only, which is the most popular code repository website [410]. acronym refers to them as General Purpose.

39
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

launched in 2013, TensorFlow [56] and Keras [359] in 2015, Pad- unite them, rather than separate them through vertical approaches. The
dlePaddle [307] in 2016, PyTorch [305] and ONNX [306] in 2017. offload model needs to evolve significantly with the help of powerful
Therefore, all of them support at least multi-core CPUs and GPUs as na- hardware–software abstractions (e.g., Unified Virtual Memory, data
tive platforms of execution. Basically, they are conceived as high-level coherence) and runtimes, to arrive at seeing the whole system as a
interfaces based on lower-level backends. This layered architecture single development platform with different processing components that
has the advantage of integrating seamlessly with novel accelerators: it can be used transparently, the same way as we see our CPU as a
would be sufficient that accelerator manufacturers provide an ad-hoc single processing element and ignore its nature as an aggregate of
backend to manage all the accelerator-specific code under the hood. functional units specialized for different operations. Given the trends
Programmers already proficient with their favorite tool can port their in various domains, in the future this might be achieved with high-
code on a novel accelerator and start developing for it without the level frameworks backed by integrated compilers and runtime that
need of learning new programming models, in principle. As shown in are able to automatically identify, at various granularities, the high-
Table 8, this is the solution adopted in many cases and we expect level operations patterns and calculations that can be demanded to
it to consolidate even more if the contour conditions remain stable. individual accelerators. In our view, the research in the accelerator
Moreover, this trend has the side effect of helping to consolidate well- programming field should point towards this goal, in order to give to
established solutions further, pushing towards a standardization of the programmers of heterogeneous systems the productivity level that they
ecosystem. are experiencing in homogeneous contexts.

5.6.3. Perspective: The future of accelerator programming 6. Related work


In both data-parallel and ML domains, we are witnessing a push to-
wards a standardization of ecosystems, with few proposals dominating Between the many previous works that discuss accelerators, a paper
the landscape. Novel accelerators usually provide backends/compilers by Caşcaval et al. [304] stands out for the similarity of intentions with
to these existing solutions, easing accelerator adoption and taking ours. In that paper, the authors present and discuss a taxonomy of
advantage of a consistent body of applications immediately on launch. accelerator architectures. We got inspiration from it, but we present
We are also witnessing proposals that try to elevate the abstraction a richer taxonomy that includes more aspects and discuss the high
level or increase portability, but these are failing in replacing well- variety of accelerators we witness nowadays. In our study, hardware
established solutions. Overall, the strategy of supporting what is al- and software aspects are presented in more details, so we prefer a
ready widespread seems to be the most effective. classification of families (e.g., NVIDIA GeForce RTX 30xx [138] or
The situation is not as neat for other domains where a widespread Intel Graphics Technology (Gen11) [105]), rather than broad cate-
solution does not exist yet, like Dataflow or Graph processing. In these gories (e.g., GPUs). Moreover, we do a performance comparison of
cases, accelerator manufacturers can try to adapt a solution commonly many accelerator models pertaining to two well-represented domains
used in other domains for their needs and take advantage of the (Data-parallel and ML), and discuss some trends. Finally, we ana-
popularity of the solution to attract developers, as done by the authors lyze open challenges that involve accelerators horizontally, and enrich
of IMP [226] and Heterogeneous-PIM [212] with TensorFlow and the discussion with state-of-the-art solutions and prospective research
OpenCL, respectively, but also by NVIDIA with CUDA, that expanded directions.
a very popular general-purpose language (C++) with data-parallel ca- To the best of our knowledge, the work by Caşcaval et al. is the
pabilities. This solution could not be available for accelerators that only one that tackles accelerators in general and describes them from a
operate in particular and/or novel domains. An original solution may holistic perspective. Other works keep part of this generality to some
be mandatory. In this case, given the trends in other domains, we argue extent, but restrict their scope to one or more aspects (e.g., [18,21–
that the only fundamental requirements are the ability of reaching 23,292,294,297–299,394,416–441]). We discuss these works in the
high-performance/efficiency (i.e., effectively exploiting the underlying following.
hardware) and a high level of abstraction. A novel accelerator could
potentially be a fundamental enabler for a number of applications and 6.1. Accelerators in particular domains
services, but after decades of high-level programming languages and
frameworks, developers are unlikely to want to go back to much lower- The most common approach when discussing accelerators is to
level techniques. This could delay the adoption of new approaches restrict their scope to a single domain, i.e., the class of applications
up to the moment when other constraints, e.g., emerging from the that can be executed on the accelerator. Between them, the most
inefficiency of high-level solutions, will appear clear. From a slightly represented domain is by far AI/ML, due to the growing interest in
different perspective, the performance and/or efficiency gains pro- this area and the consequent explosion of related accelerator proposals,
vided by a new accelerator with respect to more traditional solutions as we show in this survey. For instance, in [416], Du et al. compare
may be counterbalanced by the productivity loss of the programmers, two classes of AI-oriented accelerators: those inspired to ML and those
and thus prove not convenient overall. Conversely, a productive pro- inspired to neuroscience (called neuromorphic), from performance, ac-
gramming paradigm, even if completely different from well-established curacy, and cost standpoints. In [21], Sze et al. discuss DNNs in-depth
approaches, could be able to gain consensus in a relatively short time. in a comprehensive tutorial that covers many aspects: history, applica-
All the analysis done so far has a mono-domain perspective, since tions, popular models, hardware platforms, and so on. In particular,
the majority of applications we witness today have this characteristic. in Sections V and VI they discuss DNN processing on accelerators.
However, with the ongoing trend of integrating on SoCs various ac- In [417], Li et al. discuss various neural-network accelerators and
celerators specialized for different tasks, the opportunity of leveraging acceleration techniques, analyzing in-depth the DianNao family [181].
a tighter cooperation between them should be extensively studied In [418], Reuther et al. discuss many accelerators and processors
from a programming perspective too. Multi-domain frameworks or at for machine learning, focusing on performance and efficiency figures.
least interoperability between different ones should be investigated, In [419], Mittal and Umesh survey works on spintronic accelerators
because it could stimulate novel, complex applications that we can- for NNs and PIM. The same authors study RNN acceleration on GPUs,
not easily envision otherwise. Some examples of the possibilities of FPGAs, and ASICs [420]. They classify accelerators with respect to
this contamination can be taken from the recent achievements of ML their core-computation engine (matrix–matrix multiplication, matrix–
techniques applied to other fields: resolution upscaling, autonomous vector multiplication, etc.) and discuss optimization and simplification
driving, cancer detection, gaming [21]. This cooperation between ac- techniques. In [421], Deng et al. discuss in-depth compression in neural
celerators needs to be investigated from a perspective that tends to networks, and dedicate Section 4 to hardware accelerators. In [422],

40
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Chen et al. analyze accelerators for DNNs. They identify four broad solved with general-purpose processors under stringent power bud-
categories: on-chip accelerators, stand-alone accelerators, accelerators get requirements. A notable work that discusses the employment of
with emerging memories, and accelerators for emerging applications. accelerators in data centers is a book by Kachris et al. [394]. Here,
For each category, they discuss some sample accelerators from both the authors present many related research projects, the integration of
industry and academia. In [423], Moolchandani et al. classify and programmable resources into the cloud and edge infrastructure, or a
analyze works proposing acceleration of the CNN inference phase on study of FPGA deployment as computing nodes, which is also the focus
ASICs. They group prior works in four groups, with respect to the of [294].
metric they aim to reduce: computation time, number of computations,
memory access time, and size of memory footprint. In [424], Mittal 6.3. Reconfigurable accelerators
and Vibhu present an overview of accelerator architectures for 3D
CNNs. They present prior works on the topic and discuss algorithmic We extensively discussed reconfigurable architectures, that cur-
optimizations. rently include FPGAs, CGRAs, and accelerators that employ reconfig-
Some works restrict their scope further, considering only FPGA- urability in a minor form (e.g., DRISA [186], FPSA [192], Micron
based AI accelerators. In [442], Shawahna et al. analyze the then Automata Processor [231]).
current status of CNN acceleration, various optimization strategies In [429], Chattopadhyay surveys reconfigurable architectures. He
adopted and classify the existing proposals. In [425], Shen et al. present analyzes their evolution, classifies them based on their structure and
a technique to design FPGA-based accelerators for CNNs, that they their application domain, discusses tools and proposes future chal-
call multi-CLP (Convolutional Layer Processor). In [298], the authors lenges. In [430], Tessier et al. discuss FPGAs and CGRAs. They provide
present a survey of FPGA-based accelerators for Deep Learning, mainly a brief history and many relevant aspects such as usage, energy, em-
CNNs. They discuss existing solutions and identify promising develop- ulation, languages, tools, runtime reconfiguration, and applications.
ment strategies. A similar work is published by Mittal [297], which In [431], DeHon discusses many aspects of reconfigurable architec-
presents a survey of Convolutional and Binary Neural Network-oriented tures, such as energy, data/instruction locality, wiring, and area, with a
accelerators on FPGAs. particular attention to their mathematical modeling. In [432], Wijtvliet
Data-parallel is another well represented domain in the accelerator et al. propose a definition of CGRA and describe 36 architectures
world, since it comprehends all the contemporary GPUs and some proposed in the previous 25 years (1991–2016). Then, they present a
notable, albeit abandoned, examples like the Intel Xeon Phi family. multi-level taxonomy and classify all the discussed examples according
However, as far as we know, the only work that studies them in to it. Their work is continued and expanded by Liu et al. [18], which
general is a paper by Lee et al. [426]. Here, the authors study data- present a comprehensive survey on CGRAs. They define CGRAs and
parallel accelerators’ micro-architectures comparing SIMD, MIMD, and propose a taxonomy based on their programming, computation, and
vector technologies. They study the implications of each model to pro- execution models. Then, they analyze related challenges and discuss
grammability, area, and efficiency, proposing their own vector-thread CGRAs’ architecture and application development in-depth. Skliarova
and Sklyarov present in their book various topics related to FPGA-based
solution.
accelerators [292]. In the first two chapters, they introduce reconfig-
Other notable works include a paper by Gui et al. [427] and a
urable devices and their tools, and the architectures of FPGA-based
book by Kurzak et al. [428]. In the former, the authors present many
accelerators. Chapters 3 to 5 focus on the employment of accelerators
accelerators for graph processing based on different paradigms, and
in selected domains, like data search, data sort, etc. In Chapter 6, they
many aspects associated to graph acceleration like runtime support,
explore some hardware–software co-design techniques. We already
scheduling schemes, etc. In the latter, the authors discuss the accelera-
mentioned [297,298,425] that discuss AI accelerators implemented on
tion of scientific computing workloads, performed mainly on multicore
FPGAs.
processors, the Cell Broadband Engine [290], and GPUs [428].
6.4. PIM accelerators
6.2. Energy-constrained scenarios
The recent diffusion of in-memory accelerators inspired some sur-
As we said, the main reason of interest in accelerators is their veys on the matter. Fujiki et al. published a book about in- and
promise of improving non-functional requirements such as performance near-memory computing [23]. Chapter 6 focuses on accelerators that
and efficiency with respect to general purpose solutions. Efficiency, in leverage this design principle. They focus on ML, automata, graph,
particular, is of the utmost importance in energy-constrained contexts, database, and genomics accelerators, describing in-depth ISAAC [229]
such as embedded and data-centers. and Neural Cache [443] for ML workloads and Micron Automata
In embedded, emerging applications like edge computing and IoT Processor [231–235] for automata acceleration. Chapter 7 focuses on
would particularly benefit from tiny accelerators that would perform programming models for this class of accelerators. In [433], Mittal et al.
their tasks for a fraction of the energy needed by general-purpose survey SRAM-based in-memory-computing techniques, focusing on ML
processors and micro-controllers. In Iniewski’s book about embedded and Automata accelerators.
systems [434], various chapters are dedicated to hardware acceleration.
Chapter 2 is dedicated to hardware for computational biology, Chapter 6.5. Accelerators’ specific facets
3 to embedded GPUs, Chapters 5, 6, and 7 to FPGAs, and Chapter 14
to reconfigurable architectures for cryptography. In [435], Moyer and At this point, the readers should be familiar with the vastness of
Watanabe discuss accelerators in embedded systems from an architec- the accelerator topic. Many facets are involved, and there are works
tural perspective. They focus on accelerator characteristics (blocking that focus on one aspect in particular. To name a few, in [22], Buchty
or not, shared or not, signaling completion or not, etc.) and their et al. present a survey of heterogeneous programming techniques on
role in achieving integration into a larger system. In [436], Cardoso multicores and accelerators. In [437], Hawick and Playne study the
et al. dedicate Chapter 7 to embedded heterogeneous computing plat- software ecosystem of hardware accelerators and the limited oppor-
forms. They discuss code retargeting to leverage hardware accelerators tunities to achieve interoperability between accelerators. They discuss
(mainly GPUs and FPGAs). developmental directions to improve this situation. In [438], Cota
Hardware acceleration is taking momentum in data-centers too. et al. study and evaluate the effects of coupling techniques between
Cloud applications like big-data and AI (in particular, training of mas- integrated accelerators and other on-chip components, such as cores
sive NNs with millions of parameters) pose challenges that cannot be and caches. In [439], Verbanescu and Shen present various techniques

41
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

to leverage host and accelerator together in heterogeneous code, as advantage of specialization, as testified by the far higher efficiency
opposed to the offload model. In [440], Margerm et al. present TAPAS, that highly tailored architectures achieve with respect to more general
an HLS toolchain to generate accelerators from programs with dy- ones, not limited to ML applications. This is true even comparing
namic parallelism, supporting also heterogeneous, nested, and recursive specialized architectures of about a decade ago with the most recent
parallelism. In [441], Addazi et al. advocate for data-parallel Action data-parallel GPUs with roughly the same power budget. Also, in the
Language for Foundational UML (Alf) to express computation-intensive ML domain there is no neat distinction between accelerator types from
parts of an application, so to eliminate architecture-specific code and the perspective of throughput and efficiency capabilities, as any type
increase safety in accelerated embedded scenarios. can prove competitive.
If the first part of the paper analyzed accelerators with a vertical
7. Conclusions approach, the second part concentrated more on a horizontal approach,
with a discussion of open challenges that pertain to virtually every
In this survey, we discussed accelerators in-depth adopting a holis- accelerator. For each open challenge, we analyzed proposed state-of-
tic perspective. We presented the many aspects that constitute their the-art solutions and provided our prospective directions for future
design space by proposing a taxonomy organized in fourteen cate- research. The major points can be summarized as follows:
gories, grouped in four macro-categories: general aspects, host cou-
pling, architecture, and software aspects. We categorized about 100 • We expect the research to continue improving technological as-
accelerators from the last decade according to it, taking into considera- pects of memory-rich processors and PIM accelerators, which are
tion both commercial products and academic proposals. Some general common solutions to the memory wall problem. We argue that
considerations emerged from the overall categorization: promising programming solutions should address their similari-
ties with distributed systems, and we argue that silicon photonics
• Academia is investing a lot of effort in PIM accelerator research, has the potential to be a key ingredient in their evolution.
mostly used for ML, Graph processing, and Genome sequence • We argue that a Hybrid Reconfigurable/Programmable Acceler-
alignment tasks. Memristors are the preferred in-memory ap- ator (HRPA) has the potential to be a promising approach to
proach, while those based on 3D-stacked technologies are com- exploit reconfigurability capabilities beyond FPGAs, adopting the
mon near-memory solutions. In the latter case, general-purpose advantages of the CGRA design while improving at least its pro-
processors, simple logic, or even complex structures like arrays grammability, which is the main obstacle to CGRAs’ widespread
of PEs and tensor cores have been successfully integrated in the adoption.
logic die. • We argue that the emerging design trend promoting self-similar ac-
• Industry adopts a less experimental strategy, preferring more celerator architectures (i.e., accelerators with specific accelerating
traditional options like Manycores and GPUs, but Spatial and modules within themselves) will consolidate further and should
Systolic designs are gaining increasing attention. The former are take cooperation between different modules as a first-class design
adapt to classic workloads (e.g., data-parallel), while the latter are criterion, so to allow a seamless integration of different modules
fruitfully employed in ML. These have generally rich ecosystems, in complex applications.
with GPUs having the richest. • We expect that the variety of accelerators and applications will
• CGRAs are another frontier of academic research. They are a new not provide a single solution to the shared need of a unified
class of reconfigurable accelerators suited for Dataflow, but also coherent Virtual Memory space. We expect that, in the future, the
ML and even Data-parallel tasks. support for various scenarios will be pursued both in hardware,
• ML is by far the most popular domain, especially CNNs/DNNs. with different translation strategies, and in software, with a pos-
Inference is targeted three times more than training, and basically sible standardization of allocation primitives supporting different
every accelerator type has been employed to accelerate this task. caching strategies.
• Host coupling is usually overlooked by academic works. When • We argue that virtualization, that was historically addressed with
this does not happen, PCI-e and on-die integration are the pre- accelerator-specific solutions, should be pursued with a more gen-
ferred connection strategies, even if the diffusion of PIM solutions eral approach, giving important advantages to both designers and
is making integration on HMC logic die as common. This deter- users. We argue that a full-software strategy is not convenient,
mines a prevalence of DRAM as the preferred level of the memory and solutions based on hardware modules supporting accelerators
hierarchy that is shared with the CPU, surpassing even LLC. with different levels of complexity should be investigated instead.
• Most accelerators are programmable, with the instructions des- • We observe that both data-parallel and ML programming ecosys-
tined, in half of the cases, to special-purpose computing resources, tems are going towards standardization, with few dominating pro-
such as spatial arrays and vector processors. Despite this, fixed- posals. We argue that novel accelerators addressing these domains
function hardware blocks are also used — especially for ML- should continue the trend of supporting well-established frame-
oriented activation functions. Ad-hoc programming languages are works, while accelerators in less-standardized domains should
a surprisingly common way to program accelerators, providing at pursue efficiency and productivity above all. We argue that the re-
least a function-level granularity that guarantees high flexibility. search in this area should concentrate on a comprehensive, multi-
• TensorFlow stands out as the most popular framework/library domain framework to gain the ability to target multi-accelerator
among accelerators, while OpenCL is the most heterogeneous in systems, which may be common in the future, as homogeneous
the sense that it can run on the highest variety of accelerators, platforms, relying on advanced compilers and runtimes to identify
including Manycores, GPUs, PIMs, and FPGAs. individual acceleratable tasks and manage their targeting.

Then, we discussed the aggregated throughput and efficiency of Overall, the current state-of-the-art and perspectives depicted above
various accelerator models in Data-parallel and ML domains, which underline a dramatic and strategic need of standardization in the
are those that usually disclose these data. A general trend towards different considered directions, so that accelerated architectures will
reducing precision emerges from both, due to the drive of ML that be as naturally employable and manageable as general-purpose ones.
does not need high precision in most cases, especially for inference In some directions, this process has already started (e.g., data-parallel
tasks. ML accelerators use as little as 1-bit data types, albeit in very- and ML ecosystems), in others the first steps are being done and tested
low power accelerators. For data-parallel accelerators, throughput is (e.g., Unified Virtual Memory initiative), while in others no concrete
firmly in the hands of Manycore and GPU types, while efficiency is standardization progresses have been done yet (e.g., programming and
contended between Manycores and CGRAs. ML accelerators show the program orchestration support in PIM architectures).

42
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

Finally, we presented some related work that shares our intention of [18] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, S. Wei, A survey of
addressing accelerators in general, albeit the majority focus on subsets coarse-grained reconfigurable architecture and design: Taxonomy, challenges,
and applications, ACM Comput. Surv. 52 (6) (2019).
like single domains or particular accelerator types.
[19] Y. Xue, P. Cronin, C. Yang, J. Hu, Non-volatile memories in FPGAs: Exploiting
Overall, this work discussed accelerators from various points of logic similarity to accelerate reconfiguration and increase programming cycles,
view, both vertically, concentrating on the characteristics of single fam- in: 2015 IFIP/IEEE International Conference on Very Large Scale Integration,
ilies and models, and horizontally, discussing topics that involve accel- VLSI-SoC, 2015, pp. 92–97.
erators as a whole. It can serve both as an introduction to professionals [20] Y. Chen, J. Emer, V. Sze, Eyeriss: A spatial architecture for energy-efficient
dataflow for convolutional neural networks, in: 2016 ACM/IEEE 43rd Annual
that want to approach the accelerator world, and as a comprehensive
International Symposium on Computer Architecture, ISCA, 2016, pp. 367–379.
analysis of the current state of accelerator research, with prospective [21] V. Sze, Y. Chen, T. Yang, J.S. Emer, Efficient processing of deep neural
directions for the future. networks: A tutorial and survey, Proc. IEEE 105 (12) (2017) 2295–2329.
[22] R. Buchty, V. Heuveline, W. Karl, J.-P. Weiss, A survey on hardware-aware and
Declaration of competing interest heterogeneous computing on multicore processors and accelerators, Concurr.
Comput.: Pract. Exper. 24 (7) (2012) 663–675.
[23] D. Fujiki, X. Wang, A. Subramaniyan, R. Das, In-/Near-Memory Computing,
One or more of the authors of this paper have disclosed potential or 2021, pp. 1–124.
pertinent conflicts of interest, which may include receipt of payment, [24] S. Dave, Y. Kim, S. Avancha, K. Lee, A. Shrivastava, DMazeRunner: Executing
either direct or indirect, institutional support, or association with an perfectly nested loops on dataflow accelerators, ACM Trans. Embed. Comput.
Syst. 18 (5s) (2019).
entity in the biomedical field which may be perceived to have potential
[25] A. Munshi, B. Gaster, T.G. Mattson, J. Fung, D. Ginsburg, OpenCL Programming
conflict of interest with this work. For full disclosure statements refer Guide, first ed., Addison-Wesley Professional, 2011.
to https://doi.org/10.1016/j.sysarc.2022.102561. [26] J.B. Dennis, D.P. Misunas, A computer architecture for highly parallel signal
Biagio Peccerillo reports financial support was provided by Huawei processing, in: Proceedings of the 1974 Annual ACM Conference - Volume 2,
Technologies Research & Development (UK) Limited. Co-author Andrea ACM ’74, Association for Computing Machinery, New York, NY, USA, 1974, pp.
402–409.
Mondelli is currently employed by Huawei Technologies Research &
[27] J.B. Dennis, D.P. Misunas, A preliminary architecture for a basic data-flow
Development (UK) Limited. processor, in: Proceedings of the 2nd Annual Symposium on Computer Archi-
tecture, ISCA ’75, Association for Computing Machinery, New York, NY, USA,
Acknowledgments 1974, pp. 126–132.
[28] J.B. Dennis, First version of a data flow procedure language, in: B. Robi-
net (Ed.), Programming Symposium, Springer, Berlin, Heidelberg, 1974, pp.
This research has been funded by Huawei Technologies Research &
362–376.
Development (UK) Limited. [29] B. Furht, Handbook of Augmented Reality, Springer Publishing Company,
Incorporated, 2011.
References [30] T. Huang, Computer vision: Evolution and promise, 1996, URL https://cds.cern.
ch/record/400313.
[31] R.L. Rivest, Cryptography, computers in, in: Encyclopedia of Computer Science,
[1] W. Haensch, E.J. Nowak, R.H. Dennard, P.M. Solomon, A. Bryant, O.H.
John Wiley and Sons Ltd., GBR, 2003, pp. 468–470.
Dokumaci, A. Kumar, X. Wang, J.B. Johnson, M.V. Fischetti, Silicon CMOS
[32] Oracle, What is a database? 2022, URL https://www.oracle.com/database/
devices beyond scaling, IBM J. Res. Dev. 50 (4.5) (2006) 339–361.
what-is-database/.
[2] M. Bohr, A 30 year retrospective on Dennard’s MOSFET scaling paper, IEEE
[33] Y. Turakhia, G. Bejerano, W.J. Dally, Darwin: A Genomics co-processor provides
Solid-State Circuits Soc. Newslett. 12 (1) (2007) 11–13.
up to 15,000X acceleration on long read assembly, in: Proceedings of the
[3] D.A. Patterson, J.L. Hennessy, Computer Organization and Design RISC-V
Twenty-Third International Conference on Architectural Support for Program-
Edition: the Hardware Software Interface, Elsevier Science, 2017.
ming Languages and Operating Systems, ASPLOS ’18, Association for Computing
[4] J.L. Hennessy, D.A. Patterson, A new golden age for computer architecture,
Machinery, New York, NY, USA, 2018, pp. 199–213.
Commun. ACM 62 (2) (2019) 48–60.
[34] J.F. Hughes, A. van Dam, M. McGuire, D.F. Sklar, J.D. Foley, S.K. Feiner, K.
[5] H. Esmaeilzadeh, E. Blem, R.S. Amant, K. Sankaralingam, D. Burger, Dark
Akeley, Computer Graphics: Principles and Practice, third ed., Addison-Wesley
silicon and the end of multicore scaling, in: 2011 38th Annual International
Professional, 2014.
Symposium on Computer Architecture, ISCA, 2011, pp. 365–376. [35] D. Lee, M. Yannakakis, Principles and methods of testing finite state machines-A
[6] G.M. Amdahl, Validity of the single processor approach to achieving large scale survey, Proc. IEEE 84 (8) (1996) 1090–1123.
computing capabilities, Reprinted from the AFIPS conference proceedings, Vol. [36] A.M. Caulfield, E.S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman,
30 (Atlantic City, N.J., Apr. 18-20), AFIPS press, Reston, va., 1967, pp. 483-485, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, D. Lo, T. Massengill, K. Ovtcharov,
IEEE Solid-State Circuits Soc. Newslett. 12 (3) (2007) 19–20. M. Papamichael, L. Woods, S. Lanka, D. Chiou, D. Burger, A cloud-scale accel-
[7] J.L. Hennessy, D.A. Patterson, Computer Architecture, Sixth Edition: A Quan- eration architecture, in: 2016 49th Annual IEEE/ACM International Symposium
titative Approach, sixth ed., Morgan Kaufmann Publishers Inc., San Francisco, on Microarchitecture, MICRO, IEEE, 2016, pp. 1–13.
CA, USA, 2017. [37] S.-W. Hwang, S. Kim, Y. He, S. Elnikety, S. Choi, Prediction and predictability
[8] M. Zahran, Heterogeneous computing: Here to stay, Queue 14 (6) (2016) 31–42. for search query acceleration, ACM Trans. Web 10 (3) (2016).
[9] S. Patel, W.-M. Hwu, Accelerator architectures, IEEE Micro 28 (2008) 4–12. [38] S. Karandikar, C. Leary, C. Kennelly, J. Zhao, D. Parimi, B. Nikolic, K. Asanovic,
[10] T. Nowatzki, V. Gangadhar, N. Ardalani, K. Sankaralingam, Stream-dataflow P. Ranganathan, A hardware accelerator for protocol buffers, in: MICRO-54:
acceleration, in: 2017 ACM/IEEE 44th Annual International Symposium on 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO
Computer Architecture, ISCA, 2017, pp. 416–429. ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp.
[11] G. Pfister, Why accelerators now? 2009, http://perilsofparallel.blogspot.com/ 462–478.
2009/07/why-accelerators-now.html. [39] S. Gong, J. Li, W. Lu, G. Yan, X. Li, ShuntFlow: An efficient and scalable
[12] W.J. Dally, Y. Turakhia, S. Han, Domain-specific hardware accelerators, dataflow accelerator architecture for streaming applications, in: 2019 56th
Commun. ACM 63 (7) (2020). ACM/IEEE Design Automation Conference, DAC, 2019, pp. 1–6.
[13] S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, D. Glasco, GPUs and the [40] I. Stamoulias, M. Möller, R. Miedema, C. Strydis, C. Kachris, D. Soudris, High-
future of parallel computing, IEEE Micro 31 (5) (2011) 7–17. performance hardware accelerators for solving ordinary differential equations,
[14] Intel, Intel stratix 10 FPGAs & SoC FPGA, www.intel.com/content/www/us/en/ in: Proceedings of the 8th International Symposium on Highly Efficient Acceler-
products/details/fpga/stratix/10.html. ators and Reconfigurable Technologies, HEART2017, Association for Computing
[15] X. Li, T. Li, ECOMIPS: An economic MIPS CPU design on FPGA, in: 4th IEEE Machinery, New York, NY, USA, 2017.
International Workshop on System-on-Chip for Real-Time Applications, 2004, [41] J. Kung, Y. Long, D. Kim, S. Mukhopadhyay, A programmable hardware
pp. 291–294. accelerator for simulating dynamical systems, ACM ACM SIGARCH Comput.
[16] S. Druva Kumar, P. Sharma, K. Prajwal Shenoy, S.S. Naik, A.S. Lewis, Imple- Archit. News 45 (2) (2017) 403–415.
mentation of 16-bit hack CPU on FPGA, in: 2020 4th International Conference [42] G.A. Gillani, A. Krapukhin, A.B.J. Kokkeler, Energy-efficient approximate least
on Intelligent Computing and Control Systems, ICICCS, 2020, pp. 555–559. squares accelerator: A case study of radio astronomy calibration processing, in:
[17] K. Papadimitriou, A. Dollas, S. Hauck, Performance of partial reconfiguration in Proceedings of the 16th ACM International Conference on Computing Frontiers,
FPGA systems: A survey and a cost model, ACM Trans. Reconfigurable Technol. CF ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp.
Syst. 4 (4) (2011). 358–365.

43
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

[43] Y. Huang, N. Guo, M. Seok, Y. Tsividis, S. Sethumadhavan, Evaluation of an [77] TechPowerUp, AMD Radeon RX 5700 XT, 2019, https://www.techpowerup.
analog accelerator for linear algebra, ACM SIGARCH Comput. Archit. News 44 com/gpu-specs/radeon-rx-5700-xt.c3339.
(3) (2016) 570–582. [78] AMD, AMD RDNA 2, 2021, https://www.amd.com/en/technologies/rdna-2.
[44] L. Duch, S. Basu, M. Peón-Quirós, G. Ansaloni, L. Pozzi, D. Atienza, I-DPs CGRA: [79] TechPowerUp, AMD Radeon RX 6800 XT, 2020, https://www.techpowerup.
An interleaved-datapaths reconfigurable accelerator for embedded bio-signal com/gpu-specs/radeon-rx-6800-xt.c3694.
processing, IEEE Embed. Syst. Lett. 11 (2) (2019) 50–53. [80] TechPowerUp, AMD Radeon RX 6900 XT, 2020, https://www.techpowerup.
[45] R. Taranco, J.-M. Arnau, A. González, A low-power hardware accelerator for com/gpu-specs/radeon-rx-6900-xt.c3481.
ORB feature extraction in self-driving cars, in: 2021 IEEE 33rd International [81] Arm, ARM Mali GPU datasheet, 2021, https://developer.arm.com/-/media/Ar
Symposium on Computer Architecture and High Performance Computing, m Developer Community/PDF/Mali GPU datasheet/Arm Mali GPU Datasheet 2
SBAC-PAD, 2021, pp. 11–21. 021.2.pdf.
[46] CCIX Consortium, CCIX Website, 2021, https://www.ccixconsortium.com/. [82] Arm, Mali-G71, 2016, developer.arm.com/ip-products/graphics-and-multimedia
[47] CCIX Consortium, An introduction to CCIX - white paper, 2019, /mali-gpus/mali-g71-gpu.
https://www.ccixconsortium.com/wp-content/uploads/2019/11/CCIX-White- [83] Arm, Mali-G72, 2017, developer.arm.com/ip-products/graphics-and-multimedia
Paper-Rev111219.pdf. /mali-gpus/mali-g72-gpu.
[48] Hybrid Memory Cube Consortium, Hybrid memory cube specification 1.0, Tech. [84] Arm, Mali-G76, 2018, developer.arm.com/ip-products/graphics-and-multimedia
Rep. (2013). /mali-gpus/mali-g76-gpu.
[49] Hybrid Memory Cube Consortium, Hybrid memory cube specification 2.0, Tech. [85] J. Davies, The Bifrost GPU architecture and the ARM Mali-G71 GPU, in: 2016
Rep. (2014). IEEE Hot Chips 28 Symposium, HCS, 2016, pp. 1–31.
[50] M. Zhang, Y. Zhuo, C. Wang, M. Gao, Y. Wu, K. Chen, C. Kozyrakis, X. Qian, [86] Arm, Mali-G77, 2019, developer.arm.com/ip-products/graphics-and-multimedia
GraphP: Reducing communication for PIM-based graph processing with efficient /mali-gpus/mali-g77-gpu.
data partition, in: 2018 IEEE International Symposium on High Performance [87] Arm, Mali-G78, 2020, www.arm.com/products/silicon-ip-multimedia/gpu/mali
Computer Architecture, HPCA, 2018, pp. 544–557. -g78.
[51] W. Zhao, T. Jin, C. Chen, S. Tavallaei, Z. Wu, OCP Accelerator Module Design [88] Intel, Arria 10 FPGAs & SoCs, www.intel.com/content/www/us/en/products/d
Specification, Open Compute Project, 2019, v1.0. etails/fpga/arria/10.html.
[52] B. Brett, Multi-channel DRAM (MCDRAM) and high-bandwidth memory [89] Intel, Arria 10 product table, www.intel.com/content/dam/www/programmabl
(HBM), 2016, https://www.intel.com/content/www/us/en/developer/ e/us/en/pdfs/literature/pt/arria-10-product-table.pdf.
articles/technical/multi-channel-dram-mcdram-and-high-bandwidth-memory- [90] Intel, Arria 10 device datasheet, 2020, www.intel.com/content/dam/www/pro
hbm.html. grammable/us/en/pdfs/literature/hb/arria-10/a10_datasheet.pdf.
[91] Intel, Cyclone 10 FPGA, www.intel.com/content/www/us/en/products/details/
[53] Jedec Solid State Technology Association, High bandwidth memory (HBM)
fpga/cyclone/10.html.
DRAM, 2015, https://composter.com.ua/documents/JESD235A.pdf.
[92] Intel, Cyclone 10 GX device datasheet, 2018, www.intel.com/content/dam/ww
[54] H.-S.P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen, B. Lee, F.T. Chen,
w/programmable/us/en/pdfs/literature/hb/cyclone-10/c10gx-51002.pdf.
M.-J. Tsai, Metal-oxide RRAM, Proc. IEEE 100 (6) (2012) 1951–1970.
[93] Intel, Intel MAX 10 FPGA, www.intel.com/content/www/us/en/products/detai
[55] D. Apalkov, A. Khvalkovskiy, S. Watts, V. Nikitin, X. Tang, D. Lottis, K.
ls/fpga/max/10.html.
Moon, X. Luo, E. Chen, A. Ong, A. Driskill -Smith, M. Krounbi, Spin-transfer
[94] Intel, Intel stratix 10 GX/SX product table, www.intel.com/content/dam/www
torque magnetic random access memory (STT-MRAM), ACM J. Emerg. Technol.
/programmable/us/en/pdfs/literature/pt/stratix-10-product-table.pdf.
Comput. Syst. 9 (2) (2013).
[95] A. Davidson, A New FPGA architecture and leading-edge FinFET process
[56] Google Brain Team, TensorFlow core, 2015, https://www.tensorflow.org/
technology promise to meet next-generation system requirements, https://ww
overview.
w.intel.com/content/dam/www/programmable/us/en/pdfs/literature/wp/wp-0
[57] Khronos OpenCL Working Group, The OpenCL specification, version 1.2, 2012,
1220-hyperflex-architecture-fpga-socs.pdf.
URL https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf.
[96] Intel, Intel stratix 10 device datasheet, 2021, https://www.intel.com/content/da
[58] Khronos OpenCL Working Group, The OpenCL specification, version 2.1, 2018,
m/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10_datasheet.pdf.
URL https://www.khronos.org/registry/OpenCL/specs/opencl-2.1.pdf.
[97] M. Langhammer, E. Nurvitadhi, B. Pasca, S. Gribok, Stratix 10 NX architecture
[59] Khronos OpenCL Working Group, SYCL Provisional specification, version 1.2.1,
and applications, in: The 2021 ACM/SIGDA International Symposium on Field-
2019, URL https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf.
Programmable Gate Arrays, FPGA ’21, Association for Computing Machinery,
[60] OpenMP Architecture Review Board, OpenMP application program interface,
New York, NY, USA, 2021, pp. 57–67.
2013, www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.
[98] Intel, Intel agilex f-series FPGAs & SoCs, www.intel.com/content/www/us/en/
[61] S. Palnitkar, Verilog HDL: A Guide to Digital Design and Synthesis, Prentice-Hall
products/details/fpga/agilex/f-series.html.
Inc., USA, 1996.
[99] Intel, Intel agilex device datasheet, 2021, www.intel.com/content/dam/www/
[62] P.J. Menchini, An introduction to VHDL, in: J.P. Mermet (Ed.), Fundamen-
programmable/us/en/pdfs/literature/hb/agilex/ag_datasheet.pdf.
tals and Standards in Hardware Description Languages, Springer Netherlands, [100] J. Chromczak, M. Wheeler, C. Chiasson, D. How, M. Langhammer,
Dordrecht, 1993, pp. 359–384. T. Vanderhoek, G. Zgheib, I. Ganusov, Architectural enhancements in
[63] Khronos Group, Vulkan - Industry forged, 2021, https://www.khronos.org/ intel® agilex™FPGAs, in: Proceedings of the 2020 ACM/SIGDA International
vulkan/. Symposium on Field-Programmable Gate Arrays, FPGA ’20, Association for
[64] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar- Computing Machinery, New York, NY, USA, 2020, pp. 140–149.
rama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, [101] Intel, Arria V FPGAs and SoC FPGAs. www.intel.com/content/www/us/en/pro
2014, arXiv:1408.5093. ducts/details/fpga/arria/v.html.
[65] NVIDIA, CUDA C Programming guide, 2019, URL docs.nvidia.com/cuda/pdf/ [102] Intel, Arria V device datasheet, 2019, www.intel.com/content/dam/www/prog
CUDA_C_Programming_Guide.pdf. rammable/us/en/pdfs/literature/hb/arria-v/av_51002.pdf.
[66] Qualcomm, Hexagon DSP SDK, 2021, https://developer.qualcomm.com/ [103] Intel, Cyclone V FPGAs and SoC FPGAs. www.intel.com/content/www/us/en/
software/hexagon-dsp-sdk. products/details/fpga/cyclone/v.html.
[67] Qualcomm, Qualcomm neural processing SDK for AI, 2021, https://developer. [104] Intel, Cyclone V device datasheet, 2019, www.intel.com/content/dam/www/p
qualcomm.com/software/qualcomm-neural-processing-sdk. rogrammable/us/en/pdfs/literature/hb/cyclone-v/cv_51002.pdf.
[68] Samsung, Samsung neural SDK, https://developer.samsung.com/neural/ [105] Intel, Intel processor graphics Gen11 architecture, 2019, https://software.intel.
overview.html. com/sites/default/files/managed/db/88/The-Architecture-of-Intel-Processor-Gra
[69] UPMEM, UPMEM SDK, https://sdk.upmem.com/. phics-Gen11_R1new.pdf.
[70] Xilinx, Xilinx vitis - Unified software platform for all developers, 2021, https: [106] TechPowerUp, Intel UHD graphics G1, 2019, https://www.techpowerup.com/g
//www.xilinx.com/products/design-tools/vitis.html. pu-specs/uhd-graphics-g1.c3447.
[71] Intel, Intel quartus prime software suite, https://www.intel.com/content/www/ [107] TechPowerUp, Intel Iris graphics G4, 2019, https://www.techpowerup.com/gp
us/en/software/programmable/quartus-prime/overview.html. u-specs/iris-plus-graphics-g4.c3647.
[72] Huawei, HUAWEI HiAI DDK Quick start, 2020, https://developer.huawei.com/ [108] TechPowerUp, Intel UHD graphics G7, 2019, https://www.techpowerup.com/g
consumer/en/doc/2020105. pu-specs/iris-plus-graphics-g7.c3444.
[73] Coral, PyCoral API overview, 2020, https://coral.ai/docs/reference/py/. [109] R. Smith, The intel Xe-LP GPU architecture deep dive: Building up the next
[74] AMD, RDNA Architecture, 2019, https://www.amd.com/system/files/ generation, 2020, https://www.anandtech.com/show/15973/the-intel-xelp-gpu
documents/rdna-whitepaper.pdf. -architecture-deep-dive-building-up-from-the-bottom.
[75] TechPowerUp, AMD Radeon RX 5300, 2020, https://www.techpowerup.com/ [110] TechPowerUp, Intel UHD graphics 730, 2021, https://www.techpowerup.com/
gpu-specs/radeon-rx-5300.c3584. gpu-specs/uhd-graphics-730.c3765.
[76] TechPowerUp, AMD Radeon RX 5600 XT, 2020, https://www.techpowerup. [111] TechPowerUp, Intel UHD graphics 770, 2021, https://www.techpowerup.com/
com/gpu-specs/radeon-rx-5600-xt.c3474. gpu-specs/uhd-graphics-770.c3844.

44
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

[112] TechPowerUp, Intel Iris Xe graphics G7 96EU, 2020, https://www.techpoweru [142] Xilinx, Spartan-7, 2018, https://www.xilinx.com/support/documentation/produ
p.com/gpu-specs/iris-xe-graphics-g7-96eu.c3677. ct-briefs/spartan-7-product-brief.pdf.
[113] O. Wechsler, M. Behar, B. Daga, Spring hill (NNP-I 1000) Intel’s data center [143] Xilinx, Spartan-7 FPGAs: Meeting the cost-sensitive market requirements,
inference chip, in: 2019 IEEE Hot Chips 31 Symposium, HCS, 2019, pp. 1–12. 2020, https://www.xilinx.com/support/documentation/white_papers/wp483-sp
[114] Intel, Intel nervana neural network processor for inference (Intel Nervana artan-7-intro.pdf.
NNP-I). https://www.mouser.cn/pdfDocs/16433-1_NNP-announce_NNP-I_brief_v [144] Xilinx, Total power advantage using spartan-7 FPGAs, 2017, https://www.xili
51.pdf. nx.com/support/documentation/white_papers/wp488-spartan7-power.pdf.
[115] WikiChip, Neural network processors (NNP) - Intel nervana, https://en.wikichi [145] Xilinx, 7 series FPGAs clocking resources, 2018, https://www.xilinx.com/supp
p.org/wiki/nervana/nnp. ort/documentation/user_guides/ug472_7Series_Clocking.pdf.
[116] WikiChip, NNP-I 1100 - Intel nervana, https://en.wikichip.org/wiki/nervana/n [146] Xilinx, 7 series - product selection guide, 2020, https://www.xilinx.com/suppo
np/nnp-i_1100. rt/documentation/selection-guides/7-series-product-selection-guide.pdf.
[117] WikiChip, NNP-I 1300 - Intel nervana. https://en.wikichip.org/wiki/nervana/n [147] Xilinx, UltraScale Architecture: Highest device utilization, performance, and
np/nnp-i_1300. scalability, 2015, https://www.xilinx.com/support/documentation/white_paper
[118] B. Hickmann, J. Chen, M. Rotzin, A. Yang, M. Urbanski, S. Avancha, Intel s/wp455-utilization.pdf.
Nervana Neural Network Processor-T (NNP-T) fused floating point many-term [148] Xilinx, UltraScale Architecture and product data sheet: Overview, 2021, https:
dot product, in: 2020 IEEE 27th Symposium on Computer Arithmetic, ARITH, //www.xilinx.com/support/documentation/data_sheets/ds890-ultrascale-overvi
2020, pp. 133–136. ew.pdf.
[119] Intel, Intel nervana neural network processor for training (Intel Nervana NNP- [149] Xilinx, UltraScale Architecture DSP slice, 2020, https://www.xilinx.com/suppo
T. https://en.wikichip.org/w/images/4/40/16433-1_NNP-announce_NNP-T_brie rt/documentation/user_guides/ug579-ultrascale-dsp.pdf.
f_v4.3.pdf. [150] Xilinx, Kintex ultrascale+ FPGAs, 2020, https://www.xilinx.com/support/docu
[120] WikiChip, NNP-T 1300 - Intel Nervana. https://en.wikichip.org/wiki/nervana/ mentation/product-briefs/kintex-ultrascale-plus-product-brief.pdf.
nnp/nnp-t_1300. [151] Xilinx, Virtex ultrascale+ FPGAs, 2018, https://www.xilinx.com/support/docu
[121] WikiChip, NNP-T 1400 - Intel Nervana, https://en.wikichip.org/wiki/nervana/ mentation/product-briefs/virtex-ultrascale-plus-product-brief.pdf.
nnp/nnp-t_1400. [152] Xilinx, Versal: The first adaptive compute acceleration platform (ACAP),
[122] Intel, Intel Xeon Phi processor 7210, 2016, https://ark.intel.com/content/ww 2020, https://www.xilinx.com/support/documentation/white_papers/wp505-ve
w/us/en/ark/products/94033/intel-xeon-phi-processor-7210-16gb-1-30-ghz-64- rsal-acap.pdf.
core.html.
[153] B. Gaide, D. Gaitonde, C. Ravishankar, T. Bauer, Xilinx adaptive compute accel-
[123] Intel, Intel Xeon Phi processor 7250, 2016, https://ark.intel.com/content/ww
eration platform: Versal™architecture, in: Proceedings of the 2019 ACM/SIGDA
w/us/en/ark/products/94035/intel-xeon-phi-processor-7250-16gb-1-40-ghz-68-
International Symposium on Field-Programmable Gate Arrays, FPGA ’19,
core.html.
Association for Computing Machinery, New York, NY, USA, 2019, pp. 84–93.
[124] Intel, Intel Xeon Phi processor 7290, 2016, https://ark.intel.com/content/ww
[154] Xilinx, Versal architecture and product data sheet: Overview, 2021, https://w
w/us/en/ark/products/95830/intel-xeon-phi-processor-7290-16gb-1-50-ghz-72-
ww.xilinx.com/support/documentation/data_sheets/ds950-versal-overview.pdf.
core.html.
[155] Comtech EF Data Corporation, AHA371/AHA372 - PCI Express compression and
[125] S. Mittal, A survey on evaluating and optimizing performance of Intel Xeon
decompression accelerator card, 2014, http://www.aha.com/Uploads/aha371-3
Phi, Concurr. Comput. Prac. Exper. 32 (19) (2020).
72_brief_rev_d1.pdf.
[126] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S.
[156] Businesswire, AHA Announces 10 and 20 Gigabit per second GZIP accelera-
Hutsell, R. Agarwal, Y.-C. Liu, Knights landing: Second-generation Intel Xeon
tors, 2014, https://www.businesswire.com/news/home/20140828005942/en/A
Phi product, IEEE Micro 36 (2) (2016) 34–46.
HA-Announces-10-and-20-Gigabit-Per-Second-GZIP-Accelerators.
[127] Intel, Intel Xeon Phi processor 7235, 2017, https://ark.intel.com/content/ww
[157] Comtech EF Data Corporation, AHA374/AHA378 - PCI Express compression and
w/us/en/ark/products/128694/intel-xeon-phi-processor-7235-16gb-1-3-ghz-64-
decompression accelerator card, 2015, http://www.aha.com/Uploads/aha374-3
core.html.
78_brief_rev_c1.pdf.
[128] Intel, Intel Xeon Phi processor 7285, 2017, https://ark.intel.com/content/ww
[158] L. Promberger, R. Schwemmer, H. Fröning, Assessing the overhead of offloading
w/us/en/ark/products/128691/intel-xeon-phi-processor-7285-16gb-1-3-ghz-68-
compression tasks, in: 49th International Conference on Parallel Processing,
core.html.
ICPP Workshops ’20, Association for Computing Machinery, New York, NY,
[129] Intel, Intel Xeon Phi processor 7295, 2017, https://ark.intel.com/content/ww
USA, 2020.
w/us/en/ark/products/128690/intel-xeon-phi-processor-7295-16gb-1-5-ghz-72-
[159] Businesswire, AHA Announces 40 and 80 Gigabit per sec GZIP accelera-
core.html.
tors, 2015, https://www.businesswire.com/news/home/20150408006381/en/A
[130] NEC Corporation, SX-aurora TSUBASA architecture guide revision 1.1, 2018, h
HA-Announces-40-and-80-Gigabit-Per-Sec-GZIP-Accelerators.
ttps://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf.
[131] NEC Corporation, NEC SX-Aurora TSUBASA - Vector engine, https://www.nec. [160] Comtech EF Data Corporation, AHA604/AHA605 - RSA key and compression
com/en/global/solutions/hpc/sx/vector_engine.html. accelerators, 2016, http://www.aha.com/Uploads/aha604_605_brief_rev_c.pdf.
[132] K. Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. [161] J. Cross, Inside Apple’s A13 bionic system-on-chip, 2019, https://www.macwo
Aoyama, M. Sato, H. Kobayashi, Performance evaluation of a vector supercom- rld.com/article/233354/inside-apples-a13-bionic-system-on-chip.html.
puter SX-aurora TSUBASA, in: Proceedings of the International Conference for [162] WikiChip, A13 bionic - Apple, https://en.wikichip.org/wiki/apple/ax/a13.
High Performance Computing, Networking, Storage, and Analysis, SC ’18, IEEE [163] A. Frumusanu, The Apple iPhone 11, 11 Pro & 11 Pro Max review: Performance,
Press, 2018. battery, & camera elevated, 2019, https://www.anandtech.com/show/14892/t
[133] NEC Corporation, NEC Vector supercomputer - SX-aurora TSUBASA generation he-apple-iphone-11-pro-and-max-review/6.
2, 2020, www.nec.com/en/global/solutions/hpc/sx/docs/SX_Aurora_TSUBASA_ [164] J. Cross, A14 bionic FAQ: What you need to know about Apple’s 5nm processor,
brochure_2020_oct.pdf. 2020, https://www.macworld.com/article/234595/a14-bionic-faq-performance-
[134] NEC Corporation - AI Platform Division, An evolved and brand features-cpu-gpu-neural-engine.html.
new technology - SX-aurora TSUBASA present & future, 2020, [165] WikiChip, A14 bionic - Apple. https://en.wikichip.org/wiki/apple/ax/a14.
wscad.sbc.org.br/2020/artigos/palestras/slides-nec.pdf. [166] A. Frumusanu, Apple announces the Apple silicon M1: Ditching x86 – What to
[135] NVIDIA, NVIDIA Turing GPU architecture, 2018, https://www.nvidia.com/con expect, based on A14, 2020, https://www.anandtech.com/show/16226/apple-s
tent/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture ilicon-m1-a14-deep-dive.
/NVIDIA-Turing-Architecture-Whitepaper.pdf. [167] J. Ouyang, X. Du, Y. Ma, J. Liu, 3.3 Kunlun: A 14nm high-performance AI
[136] R. Smith, The NVIDIA GeForce RTX 2070 super & RTX 2060 super review: processor for diversified workloads, in: 2021 IEEE International Solid- State
Smaller numbers, bigger performance, 2019, https://www.anandtech.com/sho Circuits Conference, vol. 64, ISSCC, 2021, pp. 50–51.
w/14586/geforce-rtx-2070-super-rtx-2060-super-review. [168] J. Ouyang, M. Noh, Y. Wang, W. Qi, Y. Ma, C. Gu, S. Kim, K.-i. Hong, W.-K.
[137] R. Smith, NVIDIA Unveils ‘‘Titan RTX’’ video card: 2500 USD turing tensor Bae, Z. Zhao, J. Wang, P. Wu, X. Gong, J. Shi, H. Zhu, X. Du, Baidu Kunlun an
terror out later this month, 2018, https://www.anandtech.com/show/13668/n AI processor for diversified workloads, in: 2020 IEEE Hot Chips 32 Symposium,
vidia-unveils-rtx-titan-2500-top-turing. HCS, 2020, pp. 1–18.
[138] NVIDIA, NVIDIA Ampere GA102 GPU architecture, 2020, https://www.nvidia. [169] R. Kaplan, L. Yavits, R. Ginosasr, BioSEAL: In-memory biological sequence
com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.1.pdf. alignment accelerator for large-scale genomic data, in: Proceedings of the 13th
[139] NVIDIA, GeForce RTX 3060 family, 2021, https://www.nvidia.com/en-gb/gefo ACM International Systems and Storage Conference, SYSTOR ’20, Association
rce/graphics-cards/30-series/rtx-3060-3060ti/. for Computing Machinery, New York, NY, USA, 2020, pp. 36–48.
[140] NVIDIA, GeForce RTX 3070 family, 2021, https://www.nvidia.com/en-gb/gefo [170] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, Y. Chen,
rce/graphics-cards/30-series/rtx-3070-3070ti/. Cambricon-X: An accelerator for sparse neural networks, in: 2016 49th Annual
[141] NVIDIA, GeForce RTX 3090, 2021, https://www.nvidia.com/en-gb/geforce/gra IEEE/ACM International Symposium on Microarchitecture, MICRO, 2016, pp.
phics-cards/30-series/rtx-3090/. 1–12.

45
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

[171] T. Chou, W. Tang, J. Botimer, Z. Zhang, CASCADE: Connecting RRAMs to [196] N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
extend analog dataflow in an end-to-end in-memory processing paradigm, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J.
in: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T.V. Ghaemmaghami, R. Gottipati,
Microarchitecture, MICRO ’52, Association for Computing Machinery, New W. Gulland, R. Hagmann, C.R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J.
York, NY, USA, 2019, pp. 114–125. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch,
[172] Cerebras, The future of AI is here, https://cerebras.net/chip/. N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A.
[173] S.K. Moore, Huge chip smashes deep learning’s speed barrier, IEEE Spectr. 57 Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R.
(1) (2020) 24–27. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A.
[174] Wafer-scale deep learning, in: 2019 IEEE Hot Chips 31 Symposium, HCS, 2019, Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M.
pp. 1–31. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H.
[175] Coral, USB accelerator datasheet, 2019, https://coral.ai/static/files/Coral-USB- Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, D.H. Yoon,
Accelerator-datasheet.pdf. In-datacenter performance analysis of a tensor processing unit, ACM SIGARCH
[176] Google, Edge TPU, https://cloud.google.com/edge-tpu/. Comput. Archit. News 45 (2) (2017) 1–12.
[177] Google, Edge TPU performance benchmarks, 2020, https://coral.ai/docs/edget [197] Google, Cloud tensor processing units (TPUs), https://cloud.google.com/tpu/d
pu/benchmarks/. ocs/tpus.
[178] Q-engineering, Google coral edge TPU explained in depth, 2021, https://qengi [198] P. Teich, Tearing apart Google’s TPU 3.0 AI coprocessor, 2018, https://www.n
neering.eu/google-corals-tpu-explained.html. extplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/.
[179] A. Biswas, A.P. Chandrakasan, Conv-RAM: An energy-efficient SRAM with [199] D. Patterson, Domain-specific architectures for deep neural networks, 2019, ht
embedded convolution computation for low-power CNN-based machine learning tps://inst.eecs.berkeley.edu/~cs152/sp19/lectures/L20-DSA.pdf.
applications, in: 2018 IEEE International Solid - State Circuits Conference, [200] Graphcore, C2 IPU PCIe card, 2020, https://www.graphcore.ai/hubfs/assets/p
ISSCC, 2018, pp. 488–490. df/C2 Card Product Brief.pdf.
[180] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. [201] Graphcore, The IPU-machine: IPU-M2000. https://www.graphcore.ai/products/
Sun, O. Temam, DaDianNao: A machine-learning supercomputer, in: IEEE/ACM mk2/ipu-m2000-ipu-pod4.
International Symposium on Microarchitecture, 2014, pp. 609–622. [202] Graphcore, IPU-Machine: M2000 - datasheet, 2020, https://docs.graphcore.ai/
[181] Y. Chen, T. Chen, Z. Xu, N. Sun, O. Temam, DianNao Family: Energy-efficient projects/graphcore-ipu-m2000-datasheet/en/1.0.0/_downloads/3324c02e071a2
hardware accelerators for machine learning, Commun. ACM 59 (11) (2016) 6a432e11753f3fd092f/IPU-Machine_M2000_datasheet.pdf.
105–112. [203] G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, H. Yang,
[182] Y. Turakhia, S.D. Goenka, G. Bejerano, W.J. Dally, Darwin-WGA: A co-processor GraphH: A processing-in-memory architecture for large-scale graph processing,
provides increased sensitivity in whole genome alignments with high speedup, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 38 (4) (2019) 640–653.
[204] T.J. Ham, L. Wu, N. Sundaram, N. Satish, M. Martonosi, Graphicionado: A
in: IEEE International Symposium on High Performance Computer Architecture,
high-performance and energy-efficient accelerator for graph analytics, in: 2016
2019, pp. 359–372.
49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO,
[183] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, DianNao: A
2016, pp. 1–13.
Small-footprint high-throughput accelerator for ubiquitous machine-learning,
[205] Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, X. Qian, GraphQ: Scal-
ACM SIGARCH Comput. Archit. News 42 (1) (2014) 269–284.
able PIM-based graph processing, in: Proceedings of the 52nd Annual IEEE/ACM
[184] M. Kang, S.K. Gonugondla, A. Patil, N.R. Shanbhag, A multi-functional in-
International Symposium on Microarchitecture, MICRO ’52, Association for
memory inference processor using a standard 6T SRAM array, IEEE J.
Computing Machinery, New York, NY, USA, 2019, pp. 712–725.
Solid-State Circuits 53 (2) (2018) 642–655.
[206] L. Song, Y. Zhuo, X. Qian, H. Li, Y. Chen, GraphR: Accelerating graph processing
[185] M. Kang, S. Lim, S. Gonugondla, N.R. Shanbhag, An in-memory VLSI architec-
using ReRAM, in: 2018 IEEE International Symposium on High Performance
ture for convolutional neural networks, IEEE J. Emerg. Sel. Top. Circuits Syst.
Computer Architecture, HPCA, 2018, pp. 531–543.
8 (3) (2018) 494–505.
[207] J.S. Kim, D. Senol Cali, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O.
[186] S. Li, D. Niu, K.T. Malladi, H. Zheng, B. Brennan, Y. Xie, DRISA: A DRAM-based
Ergin, C. Alkan, O. Mutlu, GRIM-Filter: Fast seed location filtering in DNA
reconfigurable in-situ accelerator, in: Proceedings of the 50th Annual IEEE/ACM
read mapping using processing-in-memory technologies, BMC Genomics 19 (2)
International Symposium on Microarchitecture, MICRO-50 ’17, Association for
(2018) 89.
Computing Machinery, New York, NY, USA, 2017, pp. 288–301.
[208] Groq, Groq, https://groq.com/.
[187] M. Imani, S. Pampana, S. Gupta, M. Zhou, Y. Kim, T. Rosing, DUAL: Accel-
[209] Groq, GroqCard Accelerator, 2021, https://groq.com/GroqDocs/Product%20Sp
eration of clustering algorithms using digital-based processing in-memory, in:
ec%20Sheet%20-%20GroqCard%E2%84%A2%20Accelerator.pdf.
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture,
[210] D. Abts, et al., Think fast: A tensor streaming processor (TSP) for accelerating
MICRO, 2020, pp. 356–371.
deep learning workloads, in: Proceedings of the ACM/IEEE 47th Annual
[188] Y. Chen, T. Krishna, J. Emer, V. Sze, 14.5 Eyeriss: An energy-efficient recon-
International Symposium on Computer Architecture, ISCA ’20, IEEE Press, 2020,
figurable accelerator for deep convolutional neural networks, in: 2016 IEEE
pp. 145–158.
International Solid-State Circuits Conference, ISSCC, 2016, pp. 262–263.
[211] Hailo, Hailo-8 AI processor. https://hailo.ai/product-hailo/hailo-8/.
[189] Y. Chen, T. Yang, J. Emer, V. Sze, Eyeriss V2: A flexible accelerator for emerging [212] J. Liu, H. Zhao, M.A. Ogleari, D. Li, J. Zhao, Processing-in-memory for energy-
deep neural networks on mobile devices, IEEE J. Emerg. Sel. Top. Circuits Syst. efficient neural network training: A heterogeneous approach, in: 2018 51st
9 (2) (2019) 292–308. Annual IEEE/ACM International Symposium on Microarchitecture, MICRO,
[190] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, X. Li, FlexFlow: A flexible dataflow accel- 2018, pp. 655–668.
erator architecture for convolutional neural networks, in: IEEE Int. Symposium [213] L. Liu, Z. Li, C. Yang, C. Deng, S. Yin, S. Wei, HReA: An energy-efficient
on High Performance Computer Architecture, 2017, pp. 553–564. embedded dynamically reconfigurable fabric for 13-dwarfs processing, IEEE
[191] M. Imani, S. Gupta, Y. Kim, T. Rosing, Floatpim: In-memory acceleration of Trans. Circuits Syst. II Express Briefs 65 (3) (2018) 381–385.
deep neural network training with high precision, in: Proceedings of the 46th [214] M. Gao, C. Kozyrakis, HRL: Efficient and flexible reconfigurable logic for near-
International Symposium on Computer Architecture, ISCA ’19, Association for data processing, in: 2016 IEEE International Symposium on High Performance
Computing Machinery, New York, NY, USA, 2019, pp. 802–815. Computer Architecture, HPCA, 2016, pp. 126–137.
[192] Y. Ji, Y. Zhang, X. Xie, S. Li, P. Wang, X. Hu, Y. Zhang, Y. Xie, FPSA: [215] Huawei, Atlas 200 AI accelerator module. https://e.huawei.com/en/products/c
A Full system stack solution for reconfigurable reram-based NN accelerator loud-computing-dc/atlas/atlas-200-ai.
architecture, in: Proceedings of the Twenty-Fourth International Conference [216] G. Fan, Atlas: Opening the door to AI with massive computing power,
on Architectural Support for Programming Languages and Operating Systems, Communicate (86) (2018) 36–38.
ASPLOS ’19, Association for Computing Machinery, New York, NY, USA, 2019, [217] Huawei, Atlas AI computing solutions, 2020, https://e.huawei.com/en/materia
pp. 733–747. l/datacenter/server/740747871439431a81843da6906181d8.
[193] A. Nag, C.N. Ramachandra, R. Balasubramonian, R. Stutsman, E. Giacomin, [218] Huawei, DaVinci: A scalable architecture for neural network computing, 2020,
H. Kambalasubramanyam, P.-E. Gaillardon, GenCache: LEveraging in-cache https://www.cmc.ca/wp-content/uploads/2020/03/Zhan-Xu-Huawei.pdf.
operators for efficient sequence alignment, in: Proceedings of the 52nd An- [219] Huawei, Atlas 300I inference card, https://e.huawei.com/en/products/cloud-co
nual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, mputing-dc/atlas/atlas-300-ai.
Association for Computing Machinery, New York, NY, USA, 2019, pp. 334–346. [220] Huawei, Atlas 300T training card, https://e.huawei.com/en/products/cloud-co
[194] J. Redgrave, A. Meixner, N. Goulding-Hotta, A. Vasilyev, O. Shacham, Pixel mputing-dc/atlas/atlas-300t-training-9000.
visual core: Google’s fully programmable image, vision, and AI processor for [221] HiSilicon, Kirin 9000, https://www.hisilicon.com/en/products/Kirin/Kirin-flags
mobile devices, 2018, https://old.hotchips.org/hc30/1conf/1.02_Google_HC30. hip-chips/Kirin-9000.
Google.JasonRedgrave.V01.pdf. [222] A. Frumusanu, Huawei announces mate 40 series: Powered by 15.3bn transis-
[195] WikiChip, Pixel visual core (PVC) - Google. https://en.wikichip.org/wiki/goog tors 5nm Kirin 9000, 2020, https://www.anandtech.com/show/16156/huawei-
le/pixel_visual_core. announces-mate-40-series.

46
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

[223] D. Wenshuan, Driving AI to new horizons, Communicate (86) (2018) 4–11. [246] J. Lee, J. Lee, NP-CGRA: Extending CGRAs for efficient processing of light-
[224] HiSilicon, Kirin 990 5G, https://www.hisilicon.com/en/products/Kirin/Kirin-fl weight deep neural networks, in: 2021 Design, Automation and Test in Europe
agship-chips/Kirin-990-5G. Conference and Exhibition, DATE, 2021, pp. 1408–1413.
[225] A. Frumusanu, The Huawei mate 30 pro review: Top hardware without [247] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, L. Benini, Origami: A
Google? 2019, https://www.anandtech.com/show/15099/the-huawei-mate-30- convolutional network accelerator, in: Proceedings of the 25th Edition on Great
pro-review-top-hardware-without-google. Lakes Symposium on VLSI, GLSVLSI ’15, Association for Computing Machinery,
[226] D. Fujiki, S. Mahlke, R. Das, In-memory data parallel processor, in: Proceedings New York, NY, USA, 2015, pp. 199–204.
of the Twenty-Third International Conference on Architectural Support for [248] L. Cavigelli, L. Benini, Origami: A 803-GOp/s/W convolutional network
Programming Languages and Operating Systems, Association for Computing accelerator, IEEE Trans. Circuits Syst. Video Technol. 27 (11) (2017)
Machinery, New York, NY, USA, 2018, pp. 1–14. 2461–2475.
[227] H. Labs, Goya inference platform white paper, 2019, https://habana.ai/wp-co [249] L. Song, X. Qian, H. Li, Y. Chen, PipeLayer: A pipelined ReRAM-based
ntent/uploads/pdf/habana_labs_goya_whitepaper.pdf. accelerator for deep learning, in: 2017 IEEE International Symposium on High
Performance Computer Architecture, HPCA, 2017, pp. 541–552.
[228] H. Labs, Gaudi training platform white paper, 2019, https://habana.ai/whitep
[250] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A.
aper/Gaudi.pdf.
Pedram, C. Kozyrakis, K. Olukotun, Plasticine: A reconfigurable architecture for
[229] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J.P. Strachan,
parallel patterns, in: 2017 ACM/IEEE 44th Annual International Symposium on
M. Hu, R.S. Williams, V. Srikumar, ISAAC: A convolutional neural network
Computer Architecture, ISCA, 2017, pp. 389–402.
accelerator with in-situ analog arithmetic in crossbars, in: 2016 ACM/IEEE 43rd
[251] P.-E. Gaillardon, L. Amarú, A. Siemon, E. Linn, R. Waser, A. Chattopadhyay,
Annual International Symposium on Computer Architecture, ISCA, 2016, pp.
G. De Micheli, The programmable logic-in-memory (PLiM) computer, in: 2016
14–26.
Design, Automation and Test in Europe Conference and Exhibition, DATE, 2016,
[230] H. Mao, M. Song, T. Li, Y. Dai, J. Shu, LerGAN: A zero-free, low data movement
pp. 427–432.
and PIM-based GAN architecture, in: 2018 51st Annual IEEE/ACM International
[252] M. Gao, G. Ayers, C. Kozyrakis, Practical near-data processing for in-memory
Symposium on Microarchitecture, MICRO, 2018, pp. 669–681.
analytics frameworks, in: 2015 International Conference on Parallel Architecture
[231] P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, H. Noyes, An efficient and Compilation, PACT, 2015, pp. 113–124.
and scalable semiconductor architecture for parallel automata processing, IEEE [253] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, PRIME: A
Trans. Parallel Distrib. Syst. 25 (12) (2014) 3088–3098. Novel processing-in-memory architecture for neural network computation in
[232] K. Wang, K. Angstadt, C. Bo, N. Brunelle, E. Sadredini, T. Tracy, J. Wadden, ReRAM-based main memory, ACM SIGARCH Comput. Archit. News 44 (3)
M. Stan, K. Skadron, An overview of Micron’s automata processor, in: 2016 (2016) 27–39.
International Conference on Hardware/Software Codesign and System Synthesis, [254] P. Srivastava, M. Kang, S.K. Gonugondla, S. Lim, J. Choi, V. Adve, N.S. Kim,
CODES+ISSS, 2016, pp. 1–3. N. Shanbhag, PROMISE: An end-to-end design of a programmable mixed-signal
[233] A. Subramaniyan, R. Das, Parallel automata processor, in: 2017 ACM/IEEE 44th accelerator for machine-learning algorithms, in: 2018 ACM/IEEE 45th Annual
Annual International Symposium on Computer Architecture, ISCA, 2017, pp. International Symposium on Computer Architecture, ISCA, 2018, pp. 43–56.
600–612. [255] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, Y. Chen,
[234] I. Roy, A. Srivastava, S. Aluru, Programming techniques for the automata PuDianNao: A polyvalent machine learning accelerator, in: ACM International
processor, in: 2016 45th International Conference on Parallel Processing, ICPP, Conference on Architectural Support for Programming Languages and Operating
2016, pp. 205–210. Systems, ASPLOS ’15, 2015, pp. 369–381.
[235] S. Mittal, A survey on applications and architectural-optimizations of Micron’s [256] A. Ankit, I.E. Hajj, S.R. Chalamalasetti, G. Ndu, M. Foltin, R.S. Williams, P.
automata processor, J. Syst. Archit. 98 (2019) 135–164. Faraboschi, W.-m.W. Hwu, J.P. Strachan, K. Roy, D.S. Milojicic, PUMA: A
[236] Microsoft, Project catapult, 2018, https://www.microsoft.com/en-us/research/ Programmable ultra-efficient memristor-based accelerator for machine learning
project/project-catapult/. inference, in: Proceedings of the Twenty-Fourth International Conference on
[237] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, E. Chung, Accelerat- Architectural Support for Programming Languages and Operating Systems,
ing Deep Convolutional Neural Networks Using Specialized Hardware, Microsoft ASPLOS ’19, Association for Computing Machinery, New York, NY, USA, 2019,
Research, 2015. pp. 715–731.
[238] H. Valavi, P.J. Ramadge, E. Nestler, N. Verma, A mixed-signal binarized [257] O. Akbari, M. Kamal, A. Afzali -Kusha, M. Pedram, M. Shafique, PX-CGRA:
convolutional-neural-network accelerator integrating dense weight storage and Polymorphic approximate coarse-grained reconfigurable architecture, in: 2018
multiplication for reduced data movement, in: 2018 IEEE Symposium on VLSI Design, Automation and Test in Europe Conference and Exhibition, DATE, 2018,
Circuits, 2018, pp. 141–142. pp. 413–418.
[239] C.-X. Xue, W.-H. Chen, J.-S. Liu, J.-F. Li, W.-Y. Lin, W.-E. Lin, J.-H. Wang, W.-C. [258] L. Wu, A. Lottarini, T.K. Paine, M.A. Kim, K.A. Ross, The Q100 database
Wei, T.-W. Chang, T.-C. Chang, T.-Y. Huang, H.-Y. Kao, S.-Y. Wei, Y.-C. Chiu, processing unit, IEEE Micro 35 (3) (2015) 34–46.
[259] L. Wu, A. Lottarini, T.K. Paine, M.A. Kim, K.A. Ross, Q100: The architecture
C.-Y. Lee, C.-C. Lo, Y.-C. King, C.-J. Lin, R.-S. Liu, C.-C. Hsieh, K.-T. Tang, M.-F.
and design of a database processing unit, SIGPLAN Not. 49 (4) (2014) 255–268.
Chang, 24.1 A 1Mb Multibit ReRAM computing-in-memory macro with 14.6ns
[260] Qualcomm, Snapdragon 865, www.qualcomm.com/products/snapdragon-865-5
parallel MAC computing time for CNN based AI edge processors, in: 2019 IEEE
g-mobile-platform.
International Solid- State Circuits Conference, ISSCC, 2019, pp. 388–390.
[261] A. Frumusanu, The snapdragon 865 performance preview: Setting the stage for
[240] H. Kim, J. Sim, Y. Choi, L.-S. Kim, NAND-Net: Minimizing computational
flagship Android 2020, 2019, https://www.anandtech.com/show/15207/the-sn
complexity of in-memory processing for binary neural networks, in: 2019 IEEE
apdragon-865-performance-preview-setting-the-stage-for-flagship-android-2020.
International Symposium on High Performance Computer Architecture, HPCA,
[262] L. Codrescu, Qualcomm hexagon DSP: An architecture optimized for mobile
2019, pp. 661–673.
and multimedia communications, 2013, https://developer.qualcomm.com/qfile
[241] A. Farmahini-Farahani, J.H. Ahn, K. Morrow, N.S. Kim, NDA: Near-DRAM accel-
/27696/qualcomm-hexagon-architecture.pdf.
eration architecture leveraging commodity DRAM devices and standard memory
[263] Qualcomm, Snapdragon 888, www.qualcomm.com/products/snapdragon-888-5
modules, in: 2015 IEEE 21st International Symposium on High Performance
g-mobile-platform.
Computer Architecture, HPCA, 2015, pp. 283–295.
[264] A. Frumusanu, Qualcomm details the snapdragon 888: 3rd gen 5G & Cortex-Xl
[242] W. Huangfu, K.T. Malladi, S. Li, P. Gu, Y. Xie, NEST: DIMM Based near- on 5nm, 2020, https://www.anandtech.com/show/16271/qualcomm-snapdrago
data-processing accelerator for K-mer counting, in: Proceedings of the 39th n-888-deep-dive/2.
International Conference on Computer-Aided Design, ICCAD ’20, Association [265] W. Huangfu, S. Li, X. Hu, Y. Xie, RADAR: A 3D-ReRAM based DNA alignment
for Computing Machinery, New York, NY, USA, 2020. accelerator architecture, in: Proceedings of the 55th Annual Design Automation
[243] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Conference, DAC ’18, Association for Computing Machinery, New York, NY,
Blaaauw, R. Das, Neural Cache: Bit-serial in-cache acceleration of deep neu- USA, 2018.
ral networks, in: 2018 ACM/IEEE 45th Annual International Symposium on [266] S. Gupta, M. Imani, B. Khaleghi, V. Kumar, T. Rosing, RAPID: A ReRAM
Computer Architecture, ISCA, 2018, pp. 383–396. processing in-memory architecture for DNA sequence alignment, in: 2019
[244] D. Kim, J. Kung, S. Chai, S. Yalamanchili, S. Mukhopadhyay, Neurocube: A IEEE/ACM International Symposium on Low Power Electronics and Design,
programmable digital neuromorphic architecture with high-density 3D memory, ISLPED, 2019, pp. 1–6.
ACM SIGARCH Comput. Archit. News 44 (3) (2016) 380–392. [267] L. Liu, C. Deng, D. Wang, M. Zhu, S. Yin, P. Cao, S. Wei, An energy-efficient
[245] W.-H. Chen, K.-X. Li, W.-Y. Lin, K.-H. Hsu, P.-Y. Li, C.-H. Yang, C.-X. Xue, E.-Y. coarse-grained dynamically reconfigurable fabric for multiple-standard video
Yang, Y.-K. Chen, Y.-S. Chang, T.-H. Hsu, Y.-C. King, C.-J. Lin, R.-S. Liu, C.-C. decoding applications, in: Proceedings of the IEEE 2013 Custom Integrated
Hsieh, K.-T. Tang, M.-F. Chang, A 65nm 1Mb nonvolatile computing-in-memory Circuits Conference, 2013, pp. 1–4.
ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge [268] S.K. Gonugondla, M. Kang, N. Shanbhag, A 42pJ/decision 3.12TOPS/W robust
processors, in: 2018 IEEE International Solid - State Circuits Conference, ISSCC, in-memory machine learning classifier with on-chip training, in: 2018 IEEE
2018, pp. 494–496. International Solid - State Circuits Conference, ISSCC, 2018, pp. 490–492.

47
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

[269] J. Song, Y. Cho, J. Park, J. Jang, S. Lee, J. Song, J. Lee, I. Kang, An [296] J. Fowers, J.-Y. Kim, D. Burger, S. Hauck, A scalable high-bandwidth architec-
11.5 TOPS/W 1024-MAC butterfly structure dual-core sparsity-aware neural ture for lossless compression on FPGAs, in: 2015 IEEE 23rd Annual International
processing unit in 8nm flagship mobile soc, in: 2019 IEEE International Solid- Symposium on Field-Programmable Custom Computing Machines, 2015, pp.
State Circuits Conference, ISSCC, 2019, pp. 130–132. 52–59.
[270] Samsung, Mobile processor exynos 9825, 2019, https://www.samsung.com/se [297] S. Mittal, A survey of FPGA-based accelerators for convolutional neural
miconductor/minisite/exynos/products/mobileprocessor/exynos-9825/. networks, Neural Comput. Appl. 32 (4) (2020) 1109–1139.
[271] Samsung, Mobile processor exynos 990, 2020, https://www.samsung.com/sem [298] A.G. Blaiech, K. Ben Khalifa, C. Valderrama, M.A.C. Fernandes, M.H. Bedoui, A
iconductor/minisite/exynos/products/mobileprocessor/exynos-990/. survey and taxonomy of FPGA-based deep learning accelerators, J. Syst. Archit.
[272] J. Yang, Y. Kong, Z. Wang, Y. Liu, B. Wang, S. Yin, L. Shi, 24.4 Sandwich-RAM: 98 (2019) 331–345.
An energy-efficient in-memory BWN architecture with pulse-width modulation, [299] L. Liu, J. Luo, X. Deng, S. Li, FPGA-based acceleration of deep neural networks
in: 2019 IEEE International Solid- State Circuits Conference, ISSCC, 2019, pp. using high level method, in: 2015 10th International Conference on P2P,
394–396. Parallel, Grid, Cloud and Internet Computing, 3PGCIC, 2015, pp. 824–827.
[273] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, [300] W.A. Wulf, S.A. McKee, Hitting the memory wall: Implications of the obvious,
O. Temam, ShiDianNao: Shifting vision processing closer to the sensor, in: ACM SIGARCH Comput. Archit. News 23 (1) (1995) 20–24.
ACM/IEEE International Symposium on Computer Architecture, ISCA, 2015, pp. [301] Apple, M1, 2020, https://www.apple.com/uk/mac/m1/.
92–104. [302] D. Martin, Intel axes nervana AI chips in favor of Habana labs,
2020, https://www.crn.com/news/components-peripherals/intel-axes-nervana-a
[274] T.-H. Yang, H.-Y. Cheng, C.-L. Yang, I.-C. Tseng, H.-W. Hu, H.-S. Chang, H.-P.
i-chips-in-favor-of-habana-labs.
Li, Sparse ReRAM engine: Joint exploration of activation and weight sparsity
[303] D. Giri, P. Mantovani, L.P. Carloni, Accelerators and coherence: An SoC
in compressed neural networks, in: Proceedings of the 46th International
perspective, IEEE Micro 38 (6) (2018) 36–45.
Symposium on Computer Architecture, ISCA ’19, Association for Computing
Machinery, New York, NY, USA, 2019, pp. 236–249. [304] C. Caşcaval, S. Chatterjee, H. Franke, K. Gildea, P. Pattnaik, A taxonomy of
accelerator architectures and their programming models, IBM J. Res. Dev. 54
[275] S. Jain, A. Ranjan, K. Roy, A. Raghunathan, Computing in memory with spin-
(2010) 5.
transfer torque magnetic RAM, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
[305] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,
26 (3) (2018) 470–483.
Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito,
[276] Y. Kwon, Y. Lee, M. Rhu, TensorDIMM: A practical near-memory process-
M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala,
ing architecture for embeddings and tensor operations in deep learning, in:
PyTorch: AN imperative style, high-performance deep learning library, in: Pro-
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Mi-
ceedings of the 33rd International Conference on Neural Information Processing
croarchitecture, MICRO ’52, Association for Computing Machinery, New York,
Systems, Curran Associates Inc., Red Hook, NY, USA, 2019.
NY, USA, 2019, pp. 740–753.
[306] J. Bai, F. Lu, K. Zhang, et al., ONNX: Open Neural Network Exchange, GitHub,
[277] WikiChip, FSD chip - Tesla, https://en.wikichip.org/wiki/tesla_(car_company)/f
2019, https://github.com/onnx/onnx.
sd_chip.
[307] Y. Ma, D. Yu, T. Wu, H. Wang, PaddlePaddle: AN open-source deep learning
[278] Ian Cutress, Hot chips 31 live blogs: Tesla solution for full self driving, platform from industrial practice, Front. Data Domputing 1 (1) (2019) 105–115.
2019, https://www.anandtech.com/show/14766/hot-chips-31-live-blogs-tesla-s [308] G.B. Team, TensorFlow - for mobile and IoT, https://www.tensorflow.org/lite.
olution-for-full-self-driving. [309] Khronos OpenCL Working Group, The OpenCL specification, version 2.0, 2015,
[279] J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, A scalable processing-in-memory URL https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf.
accelerator for parallel graph processing, in: International Symposium on [310] J.-l. Gailly, P. Eggert, J. Meyering, GNU Gzip - summary, 2006, https://savan
Computer Architecture, ISCA, 2015, pp. 105–117. nah.gnu.org/projects/gzip/.
[280] M. Gao, J. Pu, X. Yang, M. Horowitz, C. Kozyrakis, TETRIS: SCalable and [311] G. Roelofs, J.-l. Gailly, M. Adler, Zlib, 2017, https://zlib.net/.
efficient neural network acceleration with 3D memory, ACM SIGARCH Comput. [312] The Apache Software Foundation, Apache hadoop, 2021, https://hadoop.apac
Archit. News 45 (1) (2017) 751–764. he.org/.
[281] M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, H. Yang, TIME: A Training- [313] The OpenSSL Project, OpenSSL - cryptography and SSL/TLS toolkit, 1999, htt
in-memory architecture for RRAM-based deep neural networks, IEEE Trans. ps://www.openssl.org/.
Comput.-Aided Des. Integr. Circuits Syst. 38 (5) (2019) 834–847. [314] AMD, ROCm, a new era in open GPU computing, 2016, https://rocm.github.io
[282] U.A. I., Untether AI - ushering in the PetaOPS era for artificial intelligence, /index.html.
2021, https://www.untether.ai/. [315] OpenACC-Standard.org, The OpenACC application programming interface,
[283] K. Morris, Untether AI - paddling PetaOps, 2020, https://www.eejournal.com/ 2019, https://www.openacc.org/sites/default/files/inline-images/Specification/
article/untether-ai-peddling-petaops/. OpenACC.3.0.pdf.
[284] L. Gwennap, Untether delivers at-memory AI, 2020, https://www.linleygroup. [316] C. Augonnet, S. Thibault, R. Namyst, P.-A. Wacrenier, StarPU: A unified
com/newsletters/newsletter_detail.php?num=6230. platform for task scheduling on heterogeneous multicore architectures, Concurr.
[285] UPMEM, Compute where the data is and without inter-node transfers, https:// Comput. Prac. Exper. 23 (2) (2011) 187–198.
www.upmem.com/technology/. [317] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe,
[286] F. Devaux, The true processing in memory accelerator, in: 2019 IEEE Hot Chips Halide: A language and compiler for optimizing parallelism, locality, and
31 Symposium, HCS, 2019, pp. 1–24. recomputation in image processing pipelines, SIGPLAN Not. 48 (6) (2013)
519–530.
[287] O. Akbari, M. Kamal, A. Afzali-Kusha, M. Pedram, M. Shafique, X-CGRA: AN
energy-efficient approximate coarse-grained reconfigurable architecture, IEEE [318] J. Bueno, L. Martinell, A. Duran, M. Farreras, X. Martorell, R.M. Badia, E.
Trans. Comput.-Aided Des. Integr. Circuits Syst. 39 (10) (2020) 2558–2571. Ayguade, J. Labarta, Productive cluster programming with OmpSs, in: Proceed-
ings of the 17th International Conference on Parallel Processing - Volume Part
[288] R. Andri, L. Cavigelli, D. Rossi, L. Benini, YodaNN: An architecture for ultralow
I, Euro-Par’11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 555–566.
power binary-weight CNN acceleration, IEEE Trans. Comput.-Aided Des. Integr.
[319] Apple, Metal - Accelerating graphics and much more, 2021, https://developer.
Circuits Syst. 37 (1) (2018) 48–60.
apple.com/metal/.
[289] Intel, 8087 Math coprocessor, 1989, https://pdf1.alldatasheet.com/datasheet-p
[320] Apple Inc., Core ML, https://developer.apple.com/machine-learning/core-ml/.
df/view/90863/INTEL/8087.html.
[321] J. Selig, The cerebras software development kit: A technical overview, 2022, h
[290] C.R. Johns, D.A. Brokenshire, Introduction to the cell broadband engine
ttps://f.hubspotusercontent30.net/hubfs/8968533/Cerebras SDK Technical Ove
architecture, IBM J. Res. Dev. 51 (5) (2007) 503–519.
rview White Paper.pdf.
[291] S. Greengard, GPUs reshape computing, Commun. ACM 59 (9) (2016) 14–16. [322] Y. Ji, Y. Zhang, W. Chen, Y. Xie, Bridge the gap between neural networks and
[292] I. Skliarova, V. Sklyarov, FPGA-BASED Hardware accelerators, in: Lecture Notes neuromorphic hardware with a neural network compiler, SIGPLAN Not. 53 (2)
in Electrical Engineering, vol. 566, Springer, Cham, 2019, p. XVI, 245. (2018) 448–460.
[293] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, H. Shen, An efficient hardware [323] Google, Camera API, 2021, https://developer.android.com/guide/topics/media
accelerator for structured sparse convolutional neural networks on FPGAs, IEEE /camera.
Trans. Very Large Scale Integr. (VLSI) Syst. 28 (9) (2020) 1953–1965. [324] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
[294] N. Mohammedali, M.O. Agyeman, A study of reconfigurable accelerators for M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
cloud computing, in: Proceedings of the 2nd International Symposium on Com- D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine
puter Science and Intelligent Control, ISCSIC ’18, Association for Computing learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
Machinery, New York, NY, USA, 2018. [325] T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in: Proceedings
[295] M. Ledwon, B.F. Cockburn, J. Han, High-throughput FPGA-based hardware ac- of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
celerators for deflate compression and decompression using high-level synthesis, and Data Mining, KDD ’16, Association for Computing Machinery, New York,
IEEE Access 8 (2020) 62207–62217. NY, USA, 2016, pp. 785–794.

48
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

[326] Graphcore, Poplar graph framework software, https://www.graphcore.ai/produ [356] M. Steuwer, P. Kegel, S. Gorlatch, SkelCL - A portable skeleton library for
cts/poplar. high-level GPU programming, in: Proceedings of the 2011 IEEE International
[327] N. Sundaram, N. Satish, M.M.A. Patwary, S.R. Dulloor, M.J. Anderson, S.G. Symposium on Parallel and Distributed Processing Workshops and PhD Fo-
Vadlamudi, D. Das, P. Dubey, GraphMat: High performance graph analytics rum, IPDPSW ’11, IEEE Computer Society, Washington, DC, USA, 2011, pp.
made productive, Proc. VLDB Endow. 8 (11) (2015) 1214–1225. 1176–1182.
[328] Hailo, Dataflow compiler - A complete & scalable software toolchain, https:// [357] M. Woo, J. Neider, T. Davis, D. Shreiner, OpenGL Programming Guide: The
hailo.ai/product-hailo/hailo-dataflow-compiler/. Official Guide to Learning OpenGL, Version 1.2, Addison-Wesley Longman
[329] Huawei, CANN chip enablement - Improving development efficiency to better Publishing Co., Inc., 1999.
match the Ascend chip enablement, https://e.huawei.com/en/products/cloud-c [358] AMD, AMD And microsoft DirectX 12, 2021, https://www.amd.com/en/techno
omputing-dc/atlas/cann. logies/directx12.
[330] MindSpore, Programming guide, 2021, https://www.mindspore.cn/docs/progra [359] F. Chollet, et al., Keras, 2015, https://keras.io.
mming_guide/en/master/index.html. [360] B. Peccerillo, S. Bartolini, PHAST - A Portable high-level modern C++ program-
[331] Huawei, MindX SDK, https://support.huaweicloud.com/intl/en-us/mindxsdk/. ming library for GPUs and multi-cores, IEEE Trans. Parallel Distrib. Syst. 30 (1)
[332] Google, Neural networks API, 2021, https://developer.android.com/ndk/guide (2019) 174–189.
s/neuralnetworks. [361] SiSoft, Intel arc alchemist graphics card lineup detailed, 2022, https://www.te
[333] Intel, DSP builder for intel FPGAs, www.intel.com/content/www/us/en/softwa chpowerup.com/292252/intel-arc-alchemist-graphics-card-lineup-detailed.
re/programmable/quartus-prime/dsp-builder.html. [362] Intel, Intel stratix 10NX FPGAs, https://www.intel.it/content/www/it/it/produ
[334] Intel, Intel high level synthesis compiler, https://www.intel.com/content/www cts/details/fpga/stratix/10/nx.html.
/us/en/software/programmable/quartus-prime/hls-compiler.html. [363] S.A. McKee, R.W. Wisniewski, Memory wall, in: D. Padua (Ed.), Encyclopedia
[335] Xilinx, Xilinx design flow for intel FPGA and SoC users, 2018, https://www.xili of Parallel Computing, Springer US, Boston, MA, 2011, pp. 1110–1116.
nx.com/support/documentation/sw_manuals/ug1192-xilinx-design-for-intel.pdf. [364] G. Bonshor, AMD releases milan-X CPUs with 3D V-cache: EPYC 7003 up to
[336] Intel, Supported APIs for intel graphics controllers, 2021, https://www.intel.co 64 cores and 768 MB L3 cache, 2022, https://www.anandtech.com/show/173
m/content/www/us/en/support/articles/000005524/graphics.html. 23/amd-releases-milan-x-cpus-with-3d-vcache-epyc-7003.
[337] Apache Incubator, A flexible and efficient library for deep learning, 2018, htt [365] H.S. Stone, A logic-in-memory computer, IEEE Trans. Comput. C-19 (1) (1970)
ps://mxnet.apache.org/versions/1.9.0/. 73–78.
[338] N. Rotem, J. Fix, S. Abdulrasool, S. Deng, R. Dzhabarov, J. Hegeman, R. [366] P. Siegl, R. Buchty, M. Berekovic, Data-centric computing frontiers: A survey on
Levenstein, B. Maher, N. Satish, J. Olesen, J. Park, A. Rakhov, M. Smelyanskiy, processing-in-memory, in: Proceedings of the Second International Symposium
Glow: Graph lowering compiler techniques for neural networks, 2018, CoRR, a on Memory Systems, MEMSYS ’16, Association for Computing Machinery, New
rXiv:1805.00907. York, NY, USA, 2016, pp. 295–308.
[339] Intel, OpenVINO - Deploy high-performance deep learning inference, [367] F. Gao, G. Tziantzioulis, D. Wentzlaff, ComputeDRAM: In-memory compute
2021, https://software.intel.com/content/www/us/en/develop/tools/openvino- using off-the-shelf DRAMs, in: Proceedings of the 52nd Annual IEEE/ACM
toolkit.html. International Symposium on Microarchitecture, MICRO ’52, Association for
[340] Intel, nGraph, https://www.intel.com/content/www/us/en/artificial-intelligenc Computing Machinery, New York, NY, USA, 2019, pp. 100–113.
e/ngraph.html. [368] X. Xin, Y. Zhang, J. Yang, ROC: DRAM-based processing with reduced operation
[341] S. Cyphers, A.K. Bansal, A. Bhiwandiwalla, J. Bobba, M. Brookhart, A. cycles, in: Proceedings of the 56th Annual Design Automation Conference 2019,
Chakraborty, W. Constable, C. Convey, L. Cook, O. Kanawi, R. Kimball, J. DAC ’19, Association for Computing Machinery, New York, NY, USA, 2019.
Knight, N. Korovaiko, V. Kumar, Y. Lao, C.R. Lishka, J. Menon, J. Myers, S.A. [369] A.B. Yoo, M.A. Jette, M. Grondona, SLURM: Simple linux utility for resource
Narayana, A. Procter, T.J. Webb, Intel nGraph: An intermediate representation, management, in: D. Feitelson, L. Rudolph, U. Schwiegelshohn (Eds.), Job
compiler, and executor for deep learning, 2018, arXiv preprintarXiv:1801.080 Scheduling Strategies for Parallel Processing, Springer Berlin Heidelberg, Berlin,
58. Heidelberg, 2003, pp. 44–60.
[342] K. Angstadt, W. Weimer, K. Skadron, RAPID Programming of pattern- [370] K. Hightower, B. Burns, J. Beda, Kubernetes: Up and Running Dive Into the
recognition processors, in: Proceedings of the Twenty-First International Future of Infrastructure, first ed., O’Reilly Media, Inc., 2017.
Conference on Architectural Support for Programming Languages and Operating [371] A. García-Guirado, R. Fernández-Pascual, J.M. García, S. Bartolini, Managing
Systems, ASPLOS ’16, Association for Computing Machinery, New York, NY, resources dynamically in hybrid photonic-electronic networks-on-chip, Concurr.
USA, 2016, pp. 593–605. Comput.: Pract. Exper. 26 (15) (2014) 2530–2550.
[343] Synario, VHDL Reference manual, 1997, https://www.ics.uci.edu/~jmoorkan/ [372] HP, The machine: A new kind of computer, https://www.hpl.hp.com/research
vhdlref/Synario VHDL Manual.pdf. /systems-research/themachine/.
[344] NEC Corporation, NEC SX-Aurora TSUBASA compilers, libraries and tools, 2021, [373] S. Bartolini, L. Benini, K. Bertels, S. Blanas, U. Brinkschulte, P. Carpenter,
www.nec.com/en/global/solutions/hpc/sx/tools.html?. G. De Micheli, M. Duranton, B. Falsafi, D. Fey, S. Hamdioui, C. Hochberger,
[345] J. Sanders, E. Kandrot, CUDA by Example: An Introduction to General-Purpose A. Mendelson, D. Meyer, I. Polian, U. Rückert, X. Salazar, W. Schindler, P.
GPU Programming, first ed., Addison-Wesley Professional, 2010. Stenstrom, T. Ungerer, Eurolab4HPC long-term vision on high-performance
[346] MathWorks, MATLAB GPU computing support for NVIDIA CUDA-enabled GPUs, computing, in: T. Ungerer, P. Carpenter (Eds.), Fundamentals and Standards
2021, https://www.mathworks.com/solutions/gpu-computing.html. in Hardware Description Languages, second ed., 2020.
[347] NVIDIA, DirectX 12, 2021, https://www.nvidia.com/en-us/geforce/technologie [374] Optalysys, The multiply and Fourier transform unit: A micro-scale optical
s/dx12/. processor, 2020, https://optalysys.com/s/Multiply_and_Fourier_Transform_white
[348] D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis, K. Olukotun, _paper_12_12_20.pdf.
Automatic generation of efficient accelerators for reconfigurable hardware, [375] J. Cong, H. Huang, C. Ma, B. Xiao, P. Zhou, A fully pipelined and dynamically
in: 2016 ACM/IEEE 43rd Annual International Symposium on Computer composable architecture of CGRA, in: 2014 IEEE 22nd Annual International
Architecture, ISCA, 2016, pp. 115–127. Symposium on Field-Programmable Custom Computing Machines, 2014, pp.
[349] J. Talbot, R.M. Yoo, C. Kozyrakis, Phoenix++: Modular MapReduce for shared- 9–16.
memory systems, in: Proceedings of the Second International Workshop on [376] IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (Revision of
MapReduce and Its Applications, MapReduce ’11, Association for Computing IEEE 754-2008), 2019, pp. 1–84.
Machinery, New York, NY, USA, 2011, pp. 9–16. [377] C. Nicol, A coarse grain reconfigurable array (CGRA) for statically scheduled
[350] M. Innes, E. Saba, K. Fischer, D. Gandhi, M.C. Rudilosso, N.M. Joy, T. Karmali, data flow computing, 2017, http://www.silicon-ukraine.com/public_materials/
A. Pal, V. Shah, Fashionable modelling with flux, 2018, CoRR, arXiv:1811.014 2017_10_08_msu_rountable/background/CGRA+Whitepaper.pdf.
57. [378] A.A.A. Donovan, B.W. Kernighan, The Go Programming Language, first ed.,
[351] M. Innes, Flux: Elegant machine learning with Julia, J. Open Source Softw. Addison-Wesley Professional, 2015.
(2018). [379] NVIDIA, NVIDIA TEsla V100 GPU architecture, 2017, https://images.nvidia.co
[352] Xilinx, Vivado design suite - HLx edition, 2021, https://www.xilinx.com/prod m/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
ucts/design-tools/vivado.html. [380] S.M. Trimberger, Three ages of FPGAs: A retrospective on the first thirty years
[353] Xilinx, Verilog reference guide, 1999, http://in.ncu.edu.tw/ncume_ee/digilogi/ of FPGA technology, Proc. IEEE 103 (3) (2015) 318–331.
vhdl/Verilog_Reference_Guide.pdf. [381] HSA Foundation, HSA Foundation, 2017, URL http://www.hsafoundation.com
[354] MathWorks, Xilinx FPGAs and Zynq SoCs - Model, verify, and program your /.
algorithms on xilinx devices, 2021, https://www.mathworks.com/solutions/fpg [382] CXL Consortium, Compute express link: The breakthrough CPU-to-device
a-asic-soc-development/xilinx.html. interconnect, 2022, https://www.computeexpresslink.org/.
[355] H.C. Edwards, C.R. Trott, Kokkos: Enabling performance portability across [383] Y. Hao, Z. Fang, G. Reinman, J. Cong, Supporting address translation for
manycore architectures, in: 2013 Extreme Scaling Workshop, XSW 2013, 2013, accelerator-centric architectures, in: 2017 IEEE International Symposium on
pp. 18–24. High Performance Computer Architecture, HPCA, 2017, pp. 37–48.

49
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

[384] P. Vogel, A. Marongiu, L. Benini, Lightweight virtual memory support for [407] D. Vasilas, S. Gerangelos, N. Koziris, VGVM: Efficient GPU capabilities in virtual
many-core accelerators in heterogeneous embedded SoCs, in: 2015 Inter- machines, in: 2016 International Conference on High Performance Computing
national Conference on Hardware/Software Codesign and System Synthesis, Simulation, HPCS, 2016, pp. 637–644.
CODES+ISSS, 2015, pp. 45–54. [408] H. Yu, A.M. Peters, A. Akshintala, C.J. Rossbach, Automatic virtualization of
[385] S. Haria, M.D. Hill, M.M. Swift, Devirtualizing memory in heterogeneous accelerators, in: Proceedings of the Workshop on Hot Topics in Operating
systems, SIGPLAN Not. 53 (2) (2018) 637–650. Systems, HotOS ’19, Association for Computing Machinery, New York, NY, USA,
[386] N. Parris, Extended system coherency: Part 1 - Cache coherency fundamentals, 2019, pp. 58–65.
2013, https://community.arm.com/developer/ip-products/processors/b/process [409] S. Govindarajan, K. Chitnis, M. Mody, G. Shurtz, S. Shivalingappa, T. Kim,
ors-ip-blog/posts/extended-system-coherency---part-1---cache-coherency-fundam Flexible and efficient sharing of high performance hardware accelerators in
entals. a safe, secure, virtualized system, in: 2020 IEEE International Conference on
[387] M. Dashti, A. Fedorova, Analyzing memory management methods on integrated Consumer Electronics - Asia, ICCE-Asia, 2020, pp. 1–4.
CPU-GPU systems, in: Proceedings of the 2017 ACM SIGPLAN International [410] D. Spinellis, Z. Kotti, A. Mockus, A dataset for GitHub repository deduplication,
Symposium on Memory Management, ISMM 2017, Association for Computing in: Proceedings of the 17th International Conference on Mining Software
Machinery, New York, NY, USA, 2017, pp. 59–69. Repositories, Association for Computing Machinery, New York, NY, USA, 2020,
[388] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, R. Ausavarungnirun, pp. 523–527.
K. Hsieh, N. Hajinazar, K.T. Malladi, H. Zheng, O. Mutlu, CoNDA: Efficient [411] ISO/IEC, Programming Languages — C++, Draft International Standard N4660,
cache coherence support for near-data accelerators, in: Proceedings of the 46th 2017, URL https://web.archive.org/web/20170325025026/http://www.open-st
International Symposium on Computer Architecture, ISCA ’19, Association for d.org/jtc1/sc22/wg21/docs/papers/2017/n4660.pdf.
Computing Machinery, New York, NY, USA, 2019, pp. 629–642. [412] J.M. Andión, M. Arenaz, G. Rodríguez, J. Touriño, A novel compiler support for
[389] P. Boudier, G. Sellers, Memory system on fusion APUs - The benefits of zero automatic parallelization on multicore systems, Parallel Comput. 39 (9) (2013)
copy, 2011, https://developer.amd.com/wordpress/media/2013/06/1004_final. 442–460, Novel on-chip parallel architectures and software support.
pdf. [413] M. Wolfe, Parallelizing compilers, ACM Comput. Surv. 28 (1) (1996) 261–262.
[390] J. Fang, S. Liu, X. Zhang, Research on cache partitioning and adaptive replace- [414] S. Apostolakis, Z. Xu, G. Chan, S. Campanoni, D.I. August, Perspective: A
ment policy for CPU-GPU heterogeneous processors, in: 2017 16th International sensible approach to speculative automatic parallelization, in: Proceedings
Symposium on Distributed Computing and Applications to Business, Engineering of the Twenty-Fifth International Conference on Architectural Support for
and Science, DCABES, 2017, pp. 19–22. Programming Languages and Operating Systems, Association for Computing
[391] J. Lee, H. Kim, TAP: A TLP-aware cache management policy for a Machinery, New York, NY, USA, 2020, pp. 351–367.
CPU-GPU heterogeneous architecture, in: IEEE International Symposium on [415] H.-S. Kim, Y.-H. Yoon, S.-O. Na, D.-S. Han, ICU-PFC: An automatic parallelizing
High-Performance Comp Architecture, 2012, pp. 1–12. compiler, in: Proceedings Fourth International Conference/Exhibition on High
[392] X. Wang, W. Zhang, Cache locking vs. partitioning for real-time computing on Performance Computing in the Asia-Pacific Region, vol. 1, 2000, pp. 243–246.
integrated CPU-GPU processors, in: 2016 IEEE 35th International Performance [416] Z. Du, D.D. Ben -Dayan Rubin, Y. Chen, L. He, T. Chen, L. Zhang, C. Wu, O.
Computing and Communications Conference, IPCCC, 2016, pp. 1–8. Temam, Neuromorphic accelerators: A comparison between neuroscience and
[393] J. Power, A. Basu, J. Gu, S. Puthoor, B.M. Beckmann, M.D. Hill, S.K. Reinhardt, machine-learning approaches, in: Proceedings of the 48th International Sympo-
D.A. Wood, Heterogeneous system coherence for integrated CPU-GPU systems, sium on Microarchitecture, MICRO-48, Association for Computing Machinery,
in: Proceedings of the 46th Annual IEEE/ACM International Symposium on New York, NY, USA, 2015, pp. 494–507.
Microarchitecture, MICRO-46, Association for Computing Machinery, New York, [417] Z. Li, Y. Wang, T. Zhi, T. Chen, A survey of neural network accelerators, Front.
NY, USA, 2013, pp. 457–467. Comput. Sci. 11 (5) (2017) 746–761.
[394] C. Kachris, B. Falsafi, D. Soudris, Hardware Accelerators in Data Centers, first [418] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, J. Kepner, Survey
ed., Springer Publishing Company, Incorporated, 2018. and benchmarking of machine learning accelerators, in: 2019 IEEE High
[395] S. Yesil, M.M. Ozdal, T. Kim, A. Ayupov, S. Burns, O. Ozturk, Hardware Performance Extreme Computing Conference, HPEC, 2019, pp. 1–9.
accelerator design for data centers, in: Proceedings of the IEEE/ACM Inter- [419] S. Umesh, S. Mittal, A survey of spintronic architectures for processing-in-
national Conference on Computer-Aided Design, ICCAD ’15, IEEE Press, 2015, memory and neural networks, J. Syst. Archit. 97 (2019) 349–372.
pp. 770–775. [420] S. Mittal, S. Umesh, A survey on hardware accelerators and optimization
[396] B. Varghese, C. Reaño, F. Silla, Accelerator virtualization in fog computing: techniques for RNNs, J. Syst. Archit. 112 (2021) 101839.
Moving from the cloud to the edge, IEEE Cloud Comput. 5 (6) (2018) 28–37. [421] L. Deng, G. Li, S. Han, L. Shi, Y. Xie, Model compression and hardware
[397] A. Spiridonov, New cloud TPU VMs make training your ML models on TPUs acceleration for neural networks: A comprehensive survey, Proc. IEEE 108 (4)
easier than ever, 2021, https://cloud.google.com/blog/products/compute/intro (2020) 485–532.
ducing-cloud-tpu-vms. [422] Y. Chen, Y. Xie, L. Song, F. Chen, T. Tang, A survey of accelerator architectures
[398] H. Nasiri, M. Goudarzi, Dynamic FPGA-accelerator sharing among concurrently for deep neural networks, Engineering 6 (3) (2020) 264–274.
running virtual machines, in: 2016 IEEE East-West Design Test Symposium, [423] D. Moolchandani, A. Kumar, S.R. Sarangi, Accelerating CNN inference on ASICs:
EWDTS, 2016, pp. 1–4. A survey, J. Syst. Archit. 113 (2021) 101887.
[399] Q. Zhao, M. Iida, T. Sueyoshi, A study of FPGA virtualization and accelerator [424] S. Mittal, Vibhu, A survey of accelerator architectures for 3D convolution neural
scheduling, in: Proceedings of the First Workshop on Emerging Technologies for networks, J. Syst. Archit. 115 (2021) 102041.
Software-Defined and Reconfigurable Hardware-Accelerated Cloud Datacenters, [425] Y. Shen, M. Ferdman, P. Milder, Maximizing CNN accelerator efficiency
ETCD’17, Association for Computing Machinery, New York, NY, USA, 2017. through resource partitioning, in: Proceedings of the 44th Annual International
[400] M.H. Quraishi, E.B. Tavakoli, F. Ren, A survey of system architectures and Symposium on Computer Architecture, ISCA ’17, Association for Computing
techniques for FPGA virtualization, IEEE Trans. Parallel Distrib. Syst. 32 (9) Machinery, New York, NY, USA, 2017, pp. 535–547.
(2021) 2216–2230. [426] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, K. Asanović,
[401] S. Gerangelos, N. Koziris, vPHI: Enabling Xeon Phi capabilities in virtual Exploring the tradeoffs between programmability and efficiency in data-parallel
machines, in: 2017 IEEE International Parallel and Distributed Processing accelerators, ACM SIGARCH Comput. Archit. News 39 (3) (2011) 129–140.
Symposium Workshops, IPDPSW, 2017, pp. 1333–1340. [427] C. Gui, L. Zheng, B. He, C. Liu, X. Chen, X. Liao, H. Jin, A survey on graph
[402] C. Lee, S. Kim, C. Yoo, VADI: GPU virtualization for an automotive platform, processing accelerators: Challenges and opportunities, J. Comput. Sci. Tech. 34
IEEE Trans. Ind. Inf. 12 (1) (2016) 277–290. (2) (2019) 339–371.
[403] K. Hong, I. Jung, W. Ryu, J.K. Choi, A study on GPU virtualization in a vir- [428] J. Kurzak, D.A. Bader, J. Dongarra, Scientific Computing with Multicore and
tualized server environment, in: 2014 International Conference on Information Accelerators, CRC Press, Inc., USA, 2010.
and Communication Technology Convergence, ICTC, 2014, pp. 472–473. [429] A. Chattopadhyay, Ingredients of adaptability: A survey of reconfigurable
[404] X.-L. Wang, H. b. Wang, Y. Sang, Z.-L. Wang, Y.-W. Luo, Optimizing GPU processors, VLSI Des. 2013 (2013).
virtualization with address mapping and delayed submission, in: 2014 IEEE [430] R. Tessier, K. Pocek, A. DeHon, Reconfigurable computing architectures, Proc.
Intl Conf on High Performance Computing and Communications, 2014 IEEE IEEE 103 (3) (2015) 332–354.
6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on [431] A. DeHon, Fundamental underpinnings of reconfigurable computing architec-
Embedded Software and Syst, HPCC, CSS, ICESS, 2014, pp. 413–416. tures, Proc. IEEE 103 (3) (2015) 355–378.
[405] A. Garg, P. Kulkarni, U. Kurkure, H. Sivaraman, L. Vu, Empirical analysis [432] M. Wijtvliet, L. Waeijen, H. Corporaal, Coarse grained reconfigurable architec-
of hardware-assisted GPU virtualization, in: 2019 IEEE 26th International tures in the past 25 years: Overview and classification, in: 2016 International
Conference on High Performance Computing, Data, and Analytics, HiPC, 2019, Conference on Embedded Computer Systems: Architectures, Modeling and
pp. 395–405. Simulation, SAMOS, 2016, pp. 235–244.
[406] U. Kurkure, H. Sivaraman, L. Vu, Virtualized GPUs in high performance [433] S. Mittal, G. Verma, B. Kaushik, F.A. Khanday, A survey of SRAM-based in-
datacenters, in: 2018 International Conference on High Performance Computing memory computing techniques and applications, J. Syst. Archit. 119 (2021)
Simulation, HPCS, 2018, pp. 887–894. 102276.

50
B. Peccerillo et al. Journal of Systems Architecture 129 (2022) 102561

[434] K. Iniewski, Embedded Systems: Hardware, Design and Implementation, first [439] A.L. Varbanescu, J. Shen, Heterogeneous computing with accelerators: An
ed., John Wiley and Sons Ltd., 111 River Street, Hoboken, NJ, USA, 2012. overview with examples, in: 2016 Forum on Specification and Design
[435] B. Moyer, Y. Watanabe, Chapter 13 - hardware accelerators, in: B. Moyer (Ed.), Languages, FDL, 2016, pp. 1–8.
Real World Multicore Embedded Systems, Newnes, Oxford, 2013, pp. 447–480. [440] S. Margerm, A. Sharifian, A. Guha, A. Shriraman, G. Pokam, TAPAS: Generating
[436] J.M.P. Cardoso, J.G.d.F. Coutinho, P.C. Diniz, Embedded Computing for High parallel accelerators from parallel programs, in: 2018 51st Annual IEEE/ACM
Performance: Efficient Mapping of Computations using Customization, Code International Symposium on Microarchitecture, MICRO, 2018, pp. 245–257.
Transformations and Compilation, first ed., Morgan Kaufmann Publishers Inc., [441] L. Addazi, F. Ciccozzi, B. Lisper, Executable modelling for highly parallel
San Francisco, CA, USA, 2017. accelerators, in: Proceedings of the 22nd International Conference on Model
[437] K.A. Hawick, D.P. Playne, Developmental directions in parallel accelerators, Driven Engineering Languages and Systems, IEEE Press, 2019, pp. 318–321.
AusPDC ’14, in: Proceedings of the Twelfth Australasian Symposium on Parallel [442] A. Shawahna, S.M. Sait, A. El-Maleh, FPGA-based accelerators of deep learning
and Distributed Computing, vol. 152, Australian Computer Society, Inc., AUS, networks for learning and classification: A review, IEEE Access 7 (2019)
2014, pp. 21–27. 7823–7859.
[438] E.G. Cota, P. Mantovani, G. Di Guglielmo, L.P. Carloni, An analysis of acceler- [443] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw,
ator coupling in heterogeneous architectures, in: 2015 52nd ACM/EDAC/IEEE R. Das, Neural cache: Bit-serial in-cache acceleration of deep neural networks,
Design Automation Conference, DAC, 2015, pp. 1–6. in: Proceedings of the 45th Annual International Symposium on Computer
Architecture, ISCA ’18, IEEE Press, 2018, pp. 383–396.

51

You might also like