0% found this document useful (0 votes)
9 views

A_High-Performance_Energy-Efficient_Modular_DMA_Engine_Architecture

This document presents a modular and highly configurable direct memory access engine architecture called intelligent DMA (iDMA), designed to enhance data transfer efficiency in modern computing systems. The iDMA architecture is composed of three parts: a front-end for control, a mid-end for complex data transfers, and a back-end for communication with on-chip protocols, achieving significant performance improvements and area reductions in various applications. The paper details the architecture's capabilities, optimizations, and integration studies, demonstrating its effectiveness in high-performance and ultra-low-energy environments.

Uploaded by

hari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

A_High-Performance_Energy-Efficient_Modular_DMA_Engine_Architecture

This document presents a modular and highly configurable direct memory access engine architecture called intelligent DMA (iDMA), designed to enhance data transfer efficiency in modern computing systems. The iDMA architecture is composed of three parts: a front-end for control, a mid-end for complex data transfers, and a back-end for communication with on-chip protocols, achieving significant performance improvements and area reductions in various applications. The paper details the architecture's capabilities, optimizations, and integration studies, demonstrating its effectiveness in high-performance and ultra-low-energy environments.

Uploaded by

hari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO.

1, JANUARY 2024 263

A High-Performance, Energy-Efficient Modular


DMA Engine Architecture
Thomas Benz , Graduate Student Member, IEEE, Michael Rogenmoser , Graduate Student Member, IEEE,
Paul Scheffler , Graduate Student Member, IEEE, Samuel Riedel , Member, IEEE,
Alessandro Ottaviano , Graduate Student Member, IEEE, Andreas Kurth , Member, IEEE,
Torsten Hoefler , Fellow, IEEE, and Luca Benini , Fellow, IEEE

Abstract—Data transfers are essential in today’s computing the latter to do useful compute. This function becomes increas-
systems as latency and complex memory access patterns are in- ingly critical with the trend towards physically larger systems
creasingly challenging to manage. Direct memory access engines [2] and ever-increasing memory bandwidths [3]. With Moore’s
(DMAES) are critically needed to transfer data independently
of the processing elements, hiding latency and achieving high Law slowing down, 2.5D and 3D integration are required to
throughput even for complex access patterns to high-latency satisfy future applications’ computational and memory needs,
memory. With the prevalence of heterogeneous systems, DMAEs leading to wider and higher-bandwidth memory systems and
must operate efficiently in increasingly diverse environments. longer access latencies [4], [5].
This work proposes a modular and highly configurable open- Without DMAEs, processing elements (PEs) need to read
source DMAE architecture called intelligent DMA (iDMA), split
into three parts that can be composed and customized indepen- and write data from and to remote memory, often relying on
dently. The front-end implements the control plane binding to deep cache hierarchies to mitigate performance and energy
the surrounding system. The mid-end accelerates complex data overheads. This paper focuses on explicitly managed memory
transfer patterns such as multi-dimensional transfers, scattering, hierarchies, where copies across the hierarchy are handled by
or gathering. The back-end interfaces with the on-chip commu- DMAEs. We refer the interested reader to [6], [7], [8], [9]
nication fabric (data plane). We assess the efficiency of iDMA in
various instantiations: In high-performance systems, we achieve for excellent surveys on cache-based memory systems. Caches
speedups of up to 15.8× with only 1 % additional area compared and DMAEs often coexist in modern computing systems as
to a base system without a DMAE. We achieve an area reduction they address different application needs. Dedicated DMAEs
of 10 % while improving ML inference performance by 23 % are introduced to efficiently and autonomously move data for
in ultra-low-energy edge AI systems over an existing DMAE workloads where memory access is predictable, weakly data-
solution. We provide area, timing, latency, and performance
characterization to guide its instantiation in various systems. dependent, and made in fairly large chunks, decoupling memory
Index Terms—DMA, DMAC, direct memory access, memory
accesses from execution and helping maximize PE time spent
systems, high-performance, energy-efficiency, edge AI, AXI, on useful compute.
TileLink. When integrating DMAEs, three main design challenges
I. INTRODUCTION must be tackled: the control-plane interface to the PEs, the
intrinsic data movement capabilities of the engine, and the on-
D IRECT memory access engines (DMAEs) form the com-
munication backbone of many contemporary computers
[1]. They concurrently move data at high throughput while
chip protocols supported in the data plane. The sheer number of
DMAEs present in literature and available as commercial prod-
hiding memory latency and minimizing processor load, freeing ucts explains why these choices are usually fixed at design time.
The increased heterogeneity in today’s accelerator-rich com-
Manuscript received 6 April 2023; revised 4 October 2023; accepted puting environments leads to even more diverse requirements
29 October 2023. Date of publication 7 November 2023; date of current
version 22 December 2023. This work was supported in part by the European
for DMAEs. Different on-chip protocols, programming models,
High Performance Computing Joint Undertaking (JU) under Framework and application profiles lead to a large variety of different direct
Partnership Agreement 800928 and Specific Grants Agreement 101036168 memory access (DMA) units used in modern system on chips
(EPI SGA2) and 101034126 (The EU Pilot). Recommended for acceptance
by T. Adegbija. (Corresponding author: Thomas Benz.)
(SoCs), hindering integration and verification efforts.
Thomas Benz, Michael Rogenmoser, Paul Scheffler, Samuel Riedel, We present a modular and highly parametric DMAE archi-
Alessandro Ottaviano, and Andreas Kurth are with the Integrated Systems tecture called intelligent DMA (iDMA), which is composed
Laboratory (IIS), ETH Zurich, 8092 Zürich, Switzerland (e-mail: tbenz@
ethz.ch; [email protected]; [email protected]; [email protected]; aottaviano@
of three distinct parts: the front-end handling PE interaction,
ethz.ch; [email protected]). the mid-end managing the engine’s lower-level data movement
Torsten Hoefler is with the Scalable Parallel Computing Laboratory (SPCL), capabilities, and the back-end implementing one or more on-
ETH Zurich, 8092 Zürich, Switzerland (e-mail: [email protected]).
Luca Benini is with the Integrated Systems Laboratory (IIS), ETH Zurich,
chip protocol interfaces. We call concrete implementations of
Switzerland and also with the Department of Electrical, Electronic and our iDMA architecture iDMAEs. All module boundaries are
Information Engineering (DEI), University of Bologna, 40126 Bologna, Italy standardized to facilitate the substitution of individual parts,
(e-mail: [email protected]).
Digital Object Identifier 10.1109/TC.2023.3329930
allowing for the same DMAE architecture to be used across a

0018-9340 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
264 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024

wide range of systems and applications. The synthesizable reg-


ister transfer level (RTL) description of iDMA, silicon-proven
in various instances, and the system bindings are available free
and open-source under a libre Apache-based license1 .
In more detail, this paper makes the following contributions:
• We specify a modular, parametric DMAE architecture
composed of interchangeable parts, allowing iDMA to Fig. 1. Schematic of iDMA: Our engines are split into three parts: at least
one front-end, one or multiple optional mid-ends, and at least one back-end.
accommodate and benefit any system.
• We optimize iDMA to minimize hardware buffering
• We add our iDMAE to a real-time system [12] and im-
through a highly agile, read-write decoupled, dataflow-
plement a dedicated mid-end autonomously launching re-
oriented transport engine that maximizes bus utilization in
peated 3D transfer tasks to reduce core load. Our mid-end
any context. Our architecture incurs no idle time between
incurs an area penalty of 11 kGE, negligible compared to
transactions, even when adapting between different on-
the surrounding system.
chip protocols, and incurs only two cycles of initial latency
• In a scaled-out cluster manycore architecture [13], we
to launch a multi-dimensional affine transfer.
integrate a novel distributed iDMAE [13] with multiple
• We propose and implement a two-stage transfer accel-
back-ends, accelerating integer workloads by 15.8× while
eration scheme: the mid-ends manage (distribute, repeat,
increasing the area by less than 1 %.
and modify) transfers while an in-stream acceleration
• In a dual-chiplet multi-cluster manycore based on Man-
port enables configurable in-flight operation on the data
ticore [14], adding iDMAE to each compute cluster en-
being transferred.
ables speedups of up to 1.5× and 8.4× on dense and
• We present and implement multiple system bindings
sparse floating-point workloads, respectively, while incur-
(front-ends) and industry-standard on-chip protocols
ring only 2.1 % in cluster area compared to a baseline
(back-ends), allowing our engines to be used in a wide
architecture without a DMA.
range of contexts, from ultra-low-power (ULP) to high-
performance computing (HPC) systems. A lightweight
data initialization feature allows iDMA to initialize
II. ARCHITECTURE
memory given various data patterns.
• We thoroughly characterize our architecture in area, tim- Unlike state-of-the-art (SoA) DMAEs, we propose a mod-
ing, and latency by creating area and timing models with ular and highly parametric DMAE architecture composed of
less than 9 % mean error and an analytical latency model, three distinct parts, as shown in Fig. 1. The front-end de-
easing instantiation in third-party designs and accelerating fines the interface through which the processor cores control
system prototyping. the DMAE, corresponding to the control plane. The back-
• We use synthetic workloads to show that iDMA achieves end or data plane implements the on-chip network manager
high bus utilization in ultra-deep memory systems with port(s) through which the DMAE moves data. Complex and
affordable area growth. Our architecture perfectly hides capable on-chip protocols like advanced eXtensible interface
latency in systems with memory hierarchies hundreds of (AXI) [15], to move data through the system, and simpler core-
stages deep. It reaches full bus utilization on transfers as local protocols like open bus protocol (OBI) [16] to connect
small as 16 B while occupying an area footprint of less to PE-local memories, are supported by the back-end. The
than 25 kGE2 in a 32-b configuration. mid-end, connecting the front- and back-end, slices complex
Furthermore, we conduct five system integration studies im- transfer descriptors provided by the front-end (e.g., when trans-
plementing and evaluating iDMA in systems spanning a wide ferring N-dimensional tensors) into one or multiple simple 1D
range of performance and complexity: transfer descriptors for the back-end to process. In addition,
• On a minimal single-core, Linux-capable SoC [10] with multiple mid-ends may be chained to enable complex trans-
reduced pin count DRAM, our iDMAE reaches speedups fer processing steps. Our ControlPULP case study in Sec-
of up to 6× on fine-granular 64-B transfers over an off-the- tion III-B shows this chaining mechanism by connecting a
shelf DMA while reducing field programmable gate array real-time and a 3D tensor mid-end to efficiently address that
(FPGA) resource requirements by more than 10 %. platform’s needs.
• In a ULP multicore edge node [11] for edge AI applica- To ensure compatibility between these three different parts,
tions, we replace the cluster DMA with our iDMAE and we specify their interfaces. From the front-end or the last mid-
improve the inference performance of MobileNetV1 from end, the back-end accepts a 1D transfer descriptor specify-
7.9 MAC/cycle to 8.3 MAC/cycle while reducing DMAE ing a source address, a destination address, transfer length,
area by 10 %. protocol, and back-end options, as seen in Fig. 2. Mid-ends
receive bundles of mid-end configuration information and a
1 https://github.com/pulp-platform/iDMA for iDMA, system repositories in
1D transfer descriptor. A mid-end will strip its configuration
the same group (e.g., mempool) for integrations. information while modifying the 1D transfer descriptor. All
2 Gate equivalent (GE) is a technology-independent figure of merit measur-
ing circuit complexity. A GE represents the area of a two-input, minimum- interfaces between front-, mid-, and back-ends feature ready-
strength NAND gate. valid handshaking and can thus be pipelined.

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 265

TABLE II
IDENTIFIER OF IMPLEMENTED MID-ENDS

Mid-end Description
tensor_2D Optimized to accelerate 2D transfers.
tensor_ND Mid-end accelerating ND transfers.
Fig. 2. Outline of the 1D transfer descriptor (exchanged between mid- and mp_split Mid-end splitting transfers along a parametric
back-end). address boundary.
mp_dist Mid-end distributing transfers over multiple back-ends.
TABLE I rt_3D Mid-end repetitively launching 3D transfers designed
IDENTIFIERS AND DESCRIPTIONS OF FRONT-ENDS EMPLOYED IN THE for real-time systems.
USE CASES. FRONT-ENDS IN GRAY ARE AVAILABLE BUT NOT FURTHER
DISCUSSED IN THIS WORK
chaining [1] is supported to allow efficient long and arbitrarily
Front-end Description Conf. shaped transfers.
Instruction-based: We present inst_64, a front-end that can
reg_32 32-b, 1D
reg_32_2d 32-b, 2D be tightly coupled to a RISC-V core encoding iDMA transfers
Core-private register-based
reg_32_3d
configuration interface for ULP-systems
32-b, 3D directly as instructions. For example, a Snitch [18] RISC-V
reg_64 64-b, 1D core using inst_64 can launch a transaction within three cycles,
reg_64_2d 64-b, 2D
enabling highly agile data transfers.
reg_32_rt_3d Core-private register-based system 32-b, 3D
binding supporting our real-time mid-end
desc_64 Transfer-descriptor-based interface 64-bit, 1D B. Mid-End
designed for 64-b systems compatible
with the Linux DMA interface
In iDMA, mid-ends process complex transfers coming from
the front-end and decompose them into one or multiple 1D
inst_64 Interface decoding a custom iDMA 64-b, 2D
RISC-V instructions used in HPC systems transfer(s), which can be handled directly by the back-end. Our
mid-end overview can be found in Table II.
Tensor Mid-ends: To support multi-dimensional transfers,
A. Front-End two distinct mid-ends are provided. Tensor_2D supports 2D
We present three front-end types: a simple and area-efficient transfers through an interface common in embedded systems
register-based scheme, an efficient microcode programming which requires the source and destination strides, the total
interface, and a high-performance descriptor-based front-end, length of the transfer, the base address, and the length of the
as shown in Table I. Our selection is tailored to the cur- 1D transfers it is composed of. Tensor_ND can be parame-
rent set of use cases; different front-ends can easily be cre- terized at compile-time to support transfers on tensors of any
ated, e.g., allowing us to use our descriptor-based binding in dimension N. It is programmed by providing the source ad-
32-b systems. dress, destination address, number of repetitions, and strides for
Register-based: Core-private register-based configuration each dimension.
interfaces are the simplest front-ends. Each PE uses its own Distribution Mid-end: In large manycore systems like
dedicated configuration space to eliminate race conditions while MemPool [13], one centralized actor may want to schedule
programming the DMAE [11]. We employ different memory- the data requests of multiple interconnect manager ports. We
mapped register layouts depending on the host system’s word implement this functionality with a distributed multi-back-end
width and whether a multi-dimensional tensor mid-end is iDMAE. To distribute work among back-ends, we create two
present. The src_address, dst_address, transfer_length, status, specialized mid-ends called mp_split and mp_dist. Mp_split
configuration, and transfer_id registers are shared between all splits a single linear transfer into multiple transfers aligned to
variants. In the case of multi-dimensional configuration, every a parametric address boundary, guaranteeing that no resulting
tensor dimension introduces three additional fields: src_stride, transfer crosses specific address boundaries. Which is required
dst_stride, and num_repetitions. After configuring the shape of when sending distributed transfers to multiple back-ends, see
a transfer, it is launched by reading from transfer_id, which Section III-D. Mp_dist then distributes the split transfers over
returns an incrementing unique transfer ID. The ID last com- multiple parallel downstream mid- or back-ends, arbitrating the
pleted may be read from the status register, enabling transfer- transfers based on their address offsets. Mp_dists’ number of
level synchronization. outgoing ports is set per default to two.
Descriptor-based: As the use of transfer descriptors is com- Real-time Mid-end: Repeated ND transfers are often re-
mon practice in Linux-capable multicore systems [1], [17], we quired for data acquisition tasks such as reading out sensor
provide desc_64, a 64-bit front-end compatible with the Linux arrays in real-time systems featuring complex address maps.
DMA interface. Given a pointer, the front-end uses a dedicated The number of dimensions, N, can be set at compile time. To
manager port to fetch transfer descriptors from memory; cur- relieve the pressure on general-purpose cores already involved
rently, the AXI, AXI-Lite [15], and OBI [16] protocols are sup- in task scheduling and computation, this can be done by iDMA
ported. The descriptors consist of a src_address, dst_address, using a specialized mid-end. The rt_3D mid-end enables a con-
transfer_length, and a run-time backend_configuration, corre- figurable number of repeated 3D transactions whose periodicity
sponding to information required for a 1D transfer. Descriptor and transfer shape are configured via the front-end. A bypass

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
266 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024

Fig. 3. The internal architecture of the back-end. The transport layer


handles the actual copying of the data on the on-chip protocol, supported
by the optional transfer legalizer and the error handler.

mechanism allows the core to dispatch unrelated transfers using


the same front- and back-end.
Fig. 4. The internal architecture of the transfer legalizer. Any given transfer
can be legalized except for zero-length transactions: They may optionally be
C. Back-End rejected.

For given on-chip protocols, a back-end implements effi-


cient in-order one-dimensional arbitrary-length transfers. Mul-
tichannel DMAEs, featuring multiple ports into the memory
system, can be built by connecting multiple back-ends to a
single front- or mid-end, requiring a similar area as true mul-
tichannel back-end but less verification and design complexity.
Arbitration between the individual back-ends can either be done
explicitly by choosing the executing back-end through software
or by an arbitration mid-end using round-robin or address-
based distribution schemes. Similar hardware will be required
in a true multichannel back-end to distribute transactions to
available channels.
The back-end comprises three parts, see Fig. 3. The er-
ror handler communicates and reacts to failing transfers,
the transfer legalizer reshapes incoming transfers to meet
protocol requirements, and the transport layer handles data
movement and possibly in-cycle switches between protocol-
specific data plane ports. Of these three units, only the transport
layer is mandatory. Fig. 5. The architecture of the transport layer. One or multiple read man-
Error Handler: An error handler may be included if the ager(s) feed a stream of bytes into the source shifter, the data flow element,
the destination shifter, and finally, into one or multiple write manager(s).
system or application requires protocol error reporting or han- denotes the in-stream accelerator.
dling. Our current error handler can either continue, abort,
or replay erroneous transfers. Replaying erroneous transfers the transfer legalizer may be omitted; legal transfers must be
allows complex ND transfers to continue in case of errors in guaranteed in software.
single back-end iterations without the need to abort and restart Transport Layer: The parametric and modular transport
the entire transfer. layer implements the protocol characteristics for legalized
When an error occurs, the back-end pauses the transfer pro- transfers and decouples read and write operations, maximizing
cessing and passes the offending transfer’s legalized burst base the bus utilization of any transfers. It uses read and write
address to its front-end. The PEs can then specify through the managers to handle protocol-specific operations, allowing it
front-end which of the three possible actions the error handler to internally operate only on generic byte streams, as shown
should take to resolve the situation. in Fig. 5. This enables our iDMA to easily support multiple
Transfer Legalizer: Shown in Fig. 4, it accepts a 1D on-chip protocols and multiple ports of the same protocol.
transfer and legalizes it to be supported by the specific on- The number and type of protocol ports available in the engine
chip protocol(s) in use. Transfer information is stored inter- must be set at compile time, whereas the given protocol port
nally and modular legalizer cores determine the transfer’s a transaction uses can be selected during run-time through
maximum legal length supported given user constraints and the front-end.
the protocols’ properties. For protocols that do not support The read and write parts of the transport layer are decoupled
bursts, the legalizer decomposes transfers into individual bus- from the legalizer by first in, first out (FIFO) buffers, allowing
sized accesses. Otherwise, splitting happens at page bound- a configurable number of outstanding transfers. A dataflow
aries, the maximum burst length supported by the protocol, element decouples the read and write parts, ensuring that only
or user-specified burst length limitations. The source and des- protocol-legal back pressure is applied to the memory system
tination protocols’ requirements are considered to guarantee at each end, coalesces transfers, and cuts long timing paths
only legal transfers are emitted. In area-constrained designs, to increase the engine’s maximum operating frequency. Fully

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 267

TABLE III
AVAILABLE ON-CHIP PROTOCOLS AND THEIR KEY CHARACTERISTICS. ALL
PROTOCOLS SHARE BYTE ADDRESSABILITY AND A READY-VALID HANDSHAKE

Request Response
Protocol Version Bursts
Channel Channel
256 beats
AXI4+ATOP [15] H.c AW ,b W ,b AR a B ,b R a
or 4 kB c
AXI4 Lite [15] H.c AW b, W b, AR a B b, R a no
AXI4 Stream [19] B Td Td unlimited
OpenHW OBI [16] v1.5.0 D R no
e UH: Fig. 6. (Top) Block diagram of the PULP-open system. (Bottom) Configu-
SiFive TileLink [20] v1.8.1 A R
power of two ration of the cluster iDMAE.
Init gh N.A. N.A. N.A. N.A.
a read b write c whichever is reached first
d symmetrical RX/TX channels e TL-UL & TL-UH supported extensions to accelerate digital signal processing (DSP) and ML
f valid is expressed through cyc, ready through ack (and rti) workloads, enabling energy-efficient ML inference in extreme-
g memory initialization pseudo protocol h read-only supported
edge AI nodes. These cores are connected to an SRAM-based
tightly-coupled data memory (TCDM) with single-cycle ac-
buffered operation may be required depending on the system cess latency, providing the processing cores with fast access to
and the memory endpoints; in this case, the small FIFO buffer shared data. While the TCDM is fast, it is very limited in size;
in the dataflow element may be replaced with an SRAM-based the platform thus features a level-two (L2) on-chip and level-
buffer, allowing entire transfers to be stored. Two shifters, one three (L3) off-chip HyperBus RAM [22]. To allow the cluster
at each end of the dataflow element, align the byte stream to fast access to these larger memories, a DMA unit is embed-
bus boundaries. In-stream accelerators, allowing operations ded, specialized for transferring data from and to the level-one
performed on the data stream during data movement, may be (L1) memory.
integrated into the dataflow element, augmenting the buffer in iDMAE Integration: In the PULP-open system, our iDMAE
the transport layer. Our dataflow-oriented architecture allows is integrated into the processing cluster with a 64-b AXI4 inter-
us to switch between multiple read managers, write managers, face to the host platform and an OBI connection to the TCDM,
and in-stream accelerators in-cycle, allowing our engine to see Fig. 6. The multi-protocol back-end is fed by a tensor_ND
asymptotically reach perfect bus utilization even when the used mid-end, configured to support three dimensions, allowing for
protocols or acceleration schemes change regularly. fast transfer of 3D data structures common in ML workloads.
Protocol Managers: The transport layer abstracts the in- At the same time, higher-dimensional transfers are handled in
terfacing on-chip protocols through read and write managers software. The back- and mid-end are configured through per-
with standardized interfaces, allowing for true multi-protocol core reg_32_3d front-ends and two additional front-ends, allow-
capabilities. Read managers receive the current read transfer’s ing the host processor to configure the iDMAE. Round-robin
base address, transfer length, and protocol-specific configu- arbitration is implemented through a round-robin arbitration
ration information as inputs. They then emit a read-aligned mid-end connecting the front-ends to the tensor_ND mid-end.
stream of data bytes to the downstream transport layer. Write Multiple per-core front-ends ensure atomic DMA access and
managers receive the write transfer information and the write- prevent interference between the cores launching transactions.
aligned stream of data bytes from the upstream transport layer Benchmarks: To evaluate iDMAE performance in a realis-
to be emitted over their on-chip protocol’s manager port. tic application, we use Dory [23] to implement MobileNetV1
Table III provides a complete list of supported protocols. The inference on PULP-open. This workload relies heavily on the
Init pseudo-protocol only provides a read manager emitting a iDMAE to transfer the data for each layer stored in L2 or off-
configurable stream of either the same repeated value, incre- chip in L3 into the cluster’s TCDM in parallel with cluster core
menting values, or a pseudorandom sequence. This enables our computation. 2D, 3D, and very small transfers are frequently
engine to accelerate memory initialization. required for this workload.
In previous versions of PULP-open, MCHAN [11] was used
III. CASE STUDIES to transfer data between the host L2 and the cluster’s TCDM.
To demonstrate the generality and real-world benefits of We assume this as a baseline for our evaluation.
iDMA, we detail its integration into five systems spanning a Results: In PULP-open, iDMAE can almost fully utilize the
wide range of capabilities, from ULP processors for edge AI, bandwidth to the L2 and TCDM in both directions: measuring
to HPC manycore architectures. with the on-board timer, a transfer of 8 KiB from the cluster’s
TCDM to L2 requires 1107 cycles, of which 1024 cycles are
required to transfer the data using a 64-b data bus. The mini-
A. PULP-Open
mal overhead is caused by configuration, system latency, and
PULP-open is a ULP edge compute platform consisting contention with other ongoing memory accesses. During Mo-
of a 32-b RISC-V microcontroller host and a parallel com- bileNetV1 inference, individual cores frequently require short
pute cluster [21]. The compute cluster comprises eight 32-b transfers, incurring a potentially high configuration overhead.
RISC-V cores with custom instruction set architecture (ISA) With its improved tensor_3D mid-end, iDMA improves the

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
268 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024

cores’ utilization and throughput for the network over MCHAN, core in the processing domain from the sDMAE in the I/O
achieving an average of 8.3 MAC/cycle compared to the previ- domain. The sDMAE autonomously realizes periodic external
ously measured 7.9 MAC/cycle. Furthermore, configured with data accesses in hardware, minimizing the context switching
similar queue depths as MCHAN, iDMAE with its reg_32_3d and response latency suffered by the manager core in a pure
achieves a 10 % reduction in the utilized area within a software-centric approach. We consider a PFCT running at
PULP cluster. 500 µs and the PVCT at 50 µs, meaning at least ten task pre-
emptions during one PFCT step with FreeRTOS preemptive
scheduling policy. The measured task context switch time in
B. ControlPULP FreeRTOS for ControlPULP is about 120 clock cycles [12],
ControlPULP [12] is an on-chip parallel microcontroller while iDMAE programming overhead for reading and applying
unit (MCU) employed as a power controller system (PCS) the computed voltages is about 100 clock cycles. From FPGA
for manycore HPC processors. It comprises a single 32-b profiling runs we find that the use of sDMAE saves about
RISC-V manager domain with 512 KiB of L2 scratchpad mem- 2200 execution cycles every scheduling period, thus increasing
ory and a programmable accelerator (cluster domain) hosting the slack of the PVCT task. Autonomous and intelligent data
eight 32-b RISC-V cores and 128 KiB of TCDM. access from the I/O domain is beneficial as it allows the two
A power control firmware (PCF) running on FreeRTOS subsystems to reside in independent power and clock domains
implements a reactive power management policy. ControlPULP that could be put to sleep and woken up when needed, reducing
receives (i) dynamic voltage and frequency scaling (DVFS) the uncore domain’s power consumption.
directives such as frequency target and power budget Our changes add minimal area overhead to the system. In
from high-level controllers and (ii) temperature from the case of eight events and sixteen outstanding transactions,
process-voltage-temperature (PVT) sensors and power the sDMAE is about 11 kGE in size, accounting for an area
consumption from voltage regulator modules (VRMs), and is increase of only 0.001 % of to the original ControlPULP’s area.
tasked to meet its constraints. The PCF consists of two periodic The overhead imposed by sDMAE is negligible when Con-
tasks, periodic frequency control task (PFCT) (low priority) trolPULP is used as an on-chip power manager for a large HPC
and periodic voltage control task (PVCT) (high priority) that processors. It has been shown [12] that the entire ControlPULP
handle the power management policy. only occupies a small area of around 0.1 % on a modern HPC
ControlPULP requires an efficient scheme to collect sensor CPU die.
data at each periodic step without adding overhead to the com-
putation part of the power management algorithm.
iDMAE Integration: As presented by Ottaviano et al. [12], C. Cheshire
the manager domain offloads the computation of the control Cheshire [10] is a minimal, technology-independent, 64-b
action to the cluster domain, which independently collects the Linux-capable SoC based around CVA6 [14]. In its default con-
relevant data from PVT sensors and VRMs. We redesign Con- figuration, Cheshire features a single CVA6 core, but coherent
trolPULP’s data movement paradigm by integrating a second multicore configurations are possible.
dedicated iDMAE, called sensor DMAE (sDMAE), in the man- iDMAE Integration: As Cheshire may be configured with
ager domain to simplify the programming model and redirect multiple cores running different operating systems or SMP
non-computational, high-latency data movement functions to Linux, we connect our iDMAE to the SoC using desc_64.
the manager domain, similar to IBM’s Pstate and Stop en- Descriptors are placed in scratchpad memory (SPM) by a core
gines [24]. Our sDMAE is enhanced with rt_3D, a mid-end and are then, on launch, fetched and executed by our iDMAE.
capable of autonomously launching repeated 3D transactions. This single-write launch ensures atomic operation in multi-
ControlPULP’s architecture is heavily inspired by PULP-open; hart environments. Support for descriptor chaining enables arbi-
iDMAE integration can thus be seen in Fig. 6. The goal of trarily shaped transfers. Furthermore, transfer descriptors allow
the extension is to further reduce software overhead for the for loose coupling between the PEs and DMAE, enabling our
data movement phase, which is beneficial to the controller’s engine to hide the memory endpoint’s latency and freeing the
slack within the control hyperperiod [25]. The sDMAE supports PEs up to do useful work.
several interface protocols, thus allowing the same underlying The used back-end is configured to a data and address width
hardware to handle multiple scenarios. of 64 b and can track eight outstanding transactions, enough to
Benchmarks and results: We evaluate the performance support efficient fine-grained accesses to the SPM and external
of the enhanced sDMAE by executing the PCF on top of memory IPs. A schematic view of Cheshire and the iDMAE
FreeRTOS within an FPGA-based (Xilinx Zynq UltraScale+) configuration can be found in Fig. 7.
hardware-in-the-loop (HIL) framework that couples the pro- Benchmarks: We use the AXI DMA v7.1 [26] from Xilinx
grammable logic implementing the PCS with a power, thermal, integrated into the Cheshire SoC as a comparison. We run
and performance model of the plant running on top of the ARM- synthetic workloads copying data elements of varying lengths,
based Processing System [12]. allowing a more direct comparison of the bus utilization at a
Data movement handled by rt_3D, which allows repeated given transfer length.
3D transactions to be launched, brings several benefits to the Results: Compared to AXI DMA v7.1, iDMAE increases
application scenario under analysis. First, it decouples the main bus utilization by almost 6× when launching fine-grained 64 B

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 269

Fig. 9. Our distributed iDMAE implemented in MemPool. Mp_split splits


the transfers along their L1 boundaries, and a tree of mp_dist mid-ends
distribute the transfer.

of banks. To connect to the SoC, it can share the existing AXI


Fig. 7. (Top) Block diagram of the Cheshire SoC. (Bottom) Configuration interconnect used to fetch instructions.
system’s AXI iDMAE. These distributed iDMAE back-ends, each controlling exclu-
sive regions of the L1 memory, greatly facilitate physical im-
plementation. However, individually controlling all back-ends
would burden the programmer and massively increase over-
head due to transfer synchronization of the individual DMAEs.
Instead, our iDMAE’s modular design allows for hiding this
complexity in hardware by using a single front-end to program
all the distributed back-ends.
As seen in Fig. 9, the mp_split mid-end splits a single DMA
request along lines of MemPool’s L1 memory’s address bound-
aries, and a binary tree of mp_dist mid-ends distributes the
Fig. 8. Bus utilization given a certain length of transfers. The performance
of our iDMAE is compared to Xilinx’s AXI DMA v7.1 [26]. The dotted line resulting requests to all back-ends.
represents the theoretical limit physically possible. Benchmarks: We evaluate MemPool’s iDMAE by compar-
ing the performance of various kernels compared to a baseline
transfers. At this granularity, iDMAE achieves almost perfect without DMA. Since MemPool requires our modular iDMAE
utilization, as shown in Fig. 8. We implemented both designs to implement a distributed DMA, a comparison with another
on a Diligent Genesis II FPGA, reducing the required LUTs DMA unit is not feasible here. First, we compare the perfor-
by over 10 % and FFs by over 23 %. As our iDMAE does not mance copying 512 KiB from L2 to L1 memory. Without a
require SRAM buffers, we can reduce the amount of BRAMs DMA, the cores can only utilize one sixteenth of the wide AXI
used from 216 Kib to zero. interconnect. The iDMAE utilizes 99 % and speeds up memory
transfers by a factor of 15.8× while incurring an area overhead
D. MemPool of less than 1 %.
The performance improvement for kernels is evaluated by
MemPool [13] is a flexible and scalable single-cluster many-
comparing a double-buffered implementation supported by our
core architecture featuring 256 32-b RISC-V cores that share
iDMAE to the cores copying data in and out before and after
1 MiB of low-latency L1 SPM distributed over 1024 banks. All
the computation. Even for heavily compute-bound kernels like
cores are individually programmable, making MemPool well-
matrix multiplication, iDMAE provides a speedup of 1.4×.
suited for massively parallel regular workloads like computa-
Less compute-intensive kernels like the convolution or discrete
tional photography or machine learning and irregular workloads
cosine transformation benefit even more from the iDMAE with
like graph processing. The large shared L1 memory simplifies
speedups of 9.5× and 7.2×, respectively. Finally, memory-
the programming model as all cores can directly communicate
bound kernels like vector addition and the dot product are
via shared memory without explicit dataflow management. The
dominated by the data transfers and reach speedups of 15.7×
L1 banks are connected to the cores via a pipelined, hierarchical
and 15.8×.
interconnect. Cores can access banks close to them within a
single cycle, while banks further away have a latency of three
or five cycles. In addition to the L1 interconnect, the cores E. Manticore-0432x2
have access to a hierarchical AXI [15] interconnect connecting
to the SoC. Manticore-0432x23 is a high-performance compute platform
iDMAE Integration: MemPool’s large scale and distributed based on the Manticore [14] 2.5D chiplet concept. It pro-
L1 memory make a monolithic DMAE incredibly expensive as vides two Linux-capable CVA6 [14] host cores and 432 Snitch
it would require a dedicated interconnect, spanning the whole [18] worker cores grouped in 48 compute clusters and sharing
MemPool architecture, connecting all 1024 memory banks. The 16 GiB of high-bandwidth memory (HBM) across two dies. It
existing interconnect between cores and L1 memory is built for enhances its RISC-V ISA with lightweight extensions to max-
narrow, single-word accesses; thus unsuitable for wide, burst- imize its floating-point unit (FPU) utilization in both regular
based transfers. 3 Manticore-0432x2 is an adapted from the original Manticore architecture
To minimize interconnect overhead to L1 memory, multiple [14]; Manticore-AAAAxB, where AAAA represents the total number of PEs
back-ends are introduced into MemPool placed close to a group and B the number of chiplets in the system.

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
270 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024

Fig. 11. Manticore-0432x2 chiplet bandwidths and speedups enabled by


iDMA on workloads with varying tile sizes.

Fig. 10. (Top) Block diagram of one Manticore-0432x2 die. (Bottom)


Configuration of the cluster DMA.
Results: Fig. 11 shows our results. For the highly compute-
bound GEMM, our iDMAEs enable moderate, but significant
speedups of 1.37× to 1.52×. Since all tile sizes enable ample
and irregular workloads. Manticore-0432x2’s use cases range cluster-internal data reuse, we see only small benefits as tiles
from DNN training and inference to stencil codes and sparse grow. Nevertheless, the cluster engines still increase peak HBM
scientific computing. read bandwidth from 17 to 26 GB/s.
iDMAE Integration: Each Snitch cluster has an iDMAE, SpMV performance, on the other hand, is notoriously data-
called cluster DMA, fetching data directly from HBM or shar- dependent and memory-bound due to a lack of data reuse.
ing data between clusters; a complex hierarchical intercon- Unable to leverage on-chip caches fed by a wider network, the
nect is required to support both efficiently. The cluster DMA baseline nearly saturates its narrow interconnect for all tile sizes
present in the SoC is used only to manage operations and at 48 GB/s. The iDMAEs only become memory bound past
initialize memory. M-sized tiles, but then approach the wide interconnect peak
Each iDMAE is tightly coupled to a data movement core, throughput of 384 GB/s. Overall, the engines enable significant
as shown in Fig. 10. The Snitch core decodes the iDMA speedups of 5.9× to 8.4×.
instructions and passes them to an inst_64 front-end. A ten- SpMM is similar to SpMV, but enables on-chip matrix data
sor_ND mid-end enables efficient 2D affine transfers. Higher- reuse, becoming compute-bound for both the baseline and iD-
dimensional or irregular transfers can be handled efficiently MAEs. Since data caching is now beneficial, the baseline over-
through fine-granular control code on the core: configuring and comes the 48 GB/s bottleneck and speedups decrease from
launching a 1D transfer incurs only three instructions, while 2D 2.9× to 4.9×. Still, iDMAEs unlock the full compute of clusters
transfers require at most six instructions to be launched. on sparse workloads while approaching the 384 GB/s peak
The cluster DMA is configured with a data width of 512 b throughput only for XL tiles.
and an address width of 48 b. It can track 32 outstanding
transactions, enabling efficient transfer and latency hiding on
fine-granular accesses to the long-latency HBM endpoint, even F. iDMA Integration and Customization
in the face of congestion from other clusters. It provides one The high degree of parameterization of iDMA might impose
AXI4 read-write port connecting to the surrounding SoC and a steep learning curve on any new designer implementing
one OBI read-write port connecting to the cluster’s banked L1 iDMA into a system. To ease the process, we provide wrapper
memory. AXI intraprotocol transfers are enabled to reorganize modules, which abstract internal, less-critical parameters away
data within HBM or L1. and only expose a critical selection of parameters: address
Benchmarks: We evaluate general matrix multiply width (AW), data width (DW), and number of outstanding
(GEMM), sparse matrix-vector multiply (SpMV), and sparse transactions (NAx) to the user. Section IV discusses the
matrix-matrix multiply (SpMM) on Manticore-0432x2 with influence of these parameters on area, timing, latency,
and without the use of cluster DMA engines. We run RTL and performance.
simulation on clusters processing double-precision tiles and use Customizing iDMA to a given target system involves two
these results to compute the performance of a single chiplet, main steps: analyzing the system’s transfer pattern and either
taking into account bandwidth bottlenecks and assuming all selecting the required mid-end(s) or creating a new one and
reused data is ideally cached. selecting an appropriate parametrization. The (NAx) parameter
Each workload is evaluated with four cluster tile sizes S, M, should be selected high enough to saturate the memory system
L, and XL; for GEMM, these are square tiles of size 24, 32, when launching the finest-granular transfers while not over-
48, and 64, while the two sparse workloads use the matrices of whelming the downstream targets.
increasing density diag, cz2548, bcsstk13, and raefsky1 from
the SuiteSparse matrix collection [27] as tiles.
IV. ARCHITECTURE RESULTS
As Snitch [18] originally does not include a DMAE, we
compare Manticore-0432x2 to an architecture where worker To deepen the insight into iDMA and highlight its versatil-
cores make all data requests with ideal capability to handle ity, we provide IP-level implementation results in this section.
outstanding transactions, but real bandwidth limitations. We first present area and timing models characterizing the

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 271

TABLE IV
AREA DECOMPOSITION OF THE DMAE CONFIGURATION USED IN THE PULP-CLUSTER, SEE SECTION III-A. THE BASE AREA IS ALWAYS REQUIRED, THE
CONTRIBUTION OF EACH PROTOCOL ADDED IS SHOWN. IF THE AREA CONTRIBUTION IS NON-ZERO, THE PARAMETER INFLUENCING THE VALUE IS
PROVIDED USING THE BIG-O NOTATION. THE AREA CONTRIBUTION SCALES LINEARLY WITH THE DATA WIDTH (DW) IF NO SCALING IS PROVIDED

AXI AXI Lite AXI Stream OBI TileLink


Units Base Init
read write read write read write read write read write
Decoupling

3.7 kGE a 1.4 kGE 1.4 kGE 310 GE 310 GE 310 GE 310 GE 310 GE 310 GE 310 GE 310 GE
- 0
O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx)

b 710 GE c 710 GE c 200 GE c 200 GE c 180 GE c 180 GE c 180 GE c 180 GE c 215 GE c 215 GE c 21 GE
1.5 kGE
State
O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW)
Legalizer

95 GE 105 GE 7 GE 8 GE 5 GE 5 GE
Backend

Page Split 0 0 0 0 0 0
O(1) O(1) O(1) O(1) O(1) O(1)
20 GE 20 GE
Pow2 Split 0 0 0 0 0 0 0 0 0 0
O(1) O(1)

Dataflow Element 1.3 kGE d 0 0 0 0 0 0 0 0 0 0 0


Transport Layer

Contribution
of Each Read/
70 GE 190 GE 30 GE 60 GE 60 GE 60 GE 60 GE 60 GE 35 GE 230 GE 150 GE 55 GE
Write Manager
Respectively

Shifter/Muxing 120 GE 250 GE c 250 GE c 75 GE c 75 GE c 180 GE c 180 GE c 170 GE c 170 GE c 65 GE c 65 GE c 0


a NAx: Number of outstanding transfers supported. Used configuration: 16. b AW: Address width. Used configuration: 32-b.
c If multiple protocols are used, only the maximum is taken. d DW: Data Width. Used configuration: 32-b.

influence of parametrization on our architecture, enabling quick A second step is required to estimate area contributions to the
and accurate estimations when integrating engines into new sys- back-end depending on the parameterization, the number, and
tems. We then use these models to show that iDMA’s area and the type of ports. We created a second param model estimating
timing scale well for any reasonable parameterization. Finally, the influence of the three main parameters, area width (AW),
we present latency results for our back-end and discuss our data width (DW), and the number of outstanding transactions
engine’s performance in three sample memory systems. For im- (NAx), on the back-end’s area contributions. We can estimate
plementation experiments, we use GlobalFoundries’ GF12LP+ the area composition of the back-end with an average error of
technology with a 13-metal stack and 7.5-track standard cell less than 9 %, given both the parameterization and the used
library in the typical process corner. We synthesize our de- read/write protocol ports as input.
signs using Synopsys Design Compiler 2022.12 in topological We provide a qualitative understanding of the influence of
mode to account for place-and-route constraints, congestion, parameterization on area by listing the parameter with the
and physical phenomena. strongest correlation using big-O notation in Table IV.
To outline the accuracy of our modeling approach, we show
the area scaling of four of our iDMAEs for different protocol
A. Area Model configurations, depending on the three main parameters, start-
We focus our evaluation and modeling effort on our ing from the base configuration. The subplots of Fig. 12 present
iDMA back-end, as both the front-end and mid-end are very the change in area when one of the three main parameters is
application- and platform-specific and can only be properly modified and the output of our two linear models combined.
evaluated in-system. Our area model was evaluated at 1 GHz The combined area model tracks the parameter-dependent area
using the typical process corner in GF12LP+. development with an average error of less than 9 %. In those
For each of the back-end’s major area contributors listed in cases where the model deviates, the modeled area is overesti-
Table IV, we fit a set of linear models using non-negative least mated, providing a safe upper bound for the back-end area.
squares. For each parametrization, our models take a vector
containing the number of ports of each protocol as an input. This B. Timing Model
set of models allows us to estimate the area decomposition of We again focus our timing analysis on the back-end, as the
the base hardware of the back-end and the area contributions front-end should be analyzed in-system and mid-ends may be
of any additional protocol port, given a particular parameter- isolated from the iDMAE’s timing by cutting timing paths
ization, with an average error of less than 4 %. For example, between front-, mid-, and back-ends. Our investigation shows
Table IV shows the modeled area decomposition for our base a multiplicative inverse dependency between the longest path
configuration of 32-b address width, 32-b data width, and two in ns and our main parameters. We use the base configuration
outstanding transfers. of the back-end to evaluate our timing model by sweeping

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
272 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024

C. Latency
Our iDMA back-ends have a fixed latency of two cycles
from receiving a 1D transfer from the front-end or the last
mid-end to the read request at a protocol port. Notably, this is
independent of the protocol selection, the number of protocol
ports, and the three main iDMA parameters. This rule only has
one exception: in a back-end without hardware legalization sup-
port, the latency reduces to one cycle. Generally, each mid-end
presented in Table II requires one additional cycle of latency.
We note, however, that the tensor_ND mid-end can be config-
ured to have zero cycles of latency, meaning that even for an
N-dimensional transfer, we can ensure that the first read request
is issued two cycles after the transfer arrives at the mid-end from
the front-end.

Fig. 12. Area scaling of a back-end base configuration (32-b address and D. Standalone Performance
data width, two outstanding transactions). Markers represent the measurement
points and lines the fitted model. We evaluated the out-of-context performance of an iDMAE
in the base configuration copying a 64 KiB transfer fragmented
in individual transfer sizes between 1 B and 1 KiB in three
different memory system models. The analysis is protocol-
agnostic as all implemented protocols support a similar out-
standing transaction mechanism; we thus use AXI4 in this
subsection without loss of generality. The three memory sys-
tems used in our evaluation differ in access cycle latency and
number of outstanding transfers.
SRAM represents the L2 memory found in the PULP-open
system (Section III-A) with three cycles of latency and eight
outstanding transfers. RPC-DRAM uses the characteristics of an
open-source AXI4 controller for the RPC DRAM technology
[28], [29] run at 933 MHz, around thirteen cycles of latency
and support for sixteen outstanding transactions. HBM models
an industry-grade HBM [14] interface with a latency in the
order of 100 cycles and supporting the tracking of more than
Fig. 13. Clock frequency scaling of a back-end base configuration (32-b
64 outstanding transfers.
address and data width, two outstanding transactions). In shallow memory systems, the iDMAE reaches almost
perfect bus utilization copying single bus-sized data transfers
our three main parameters. The tracking of our model is pre- required to track as low as eight outstanding transactions.
sented in Fig. 13 for six representative configurations ranging More outstanding requests are required in deeper memory
from simple OBI to complex multi-protocol configurations in- systems to sustain this perfect utilization. Any transfers smaller
volving AXI. Our timing model achieves an average error of than the bus width will inevitably lead to a substantial drop
less than 4 %. in utilization, meaning that unaligned transfers inherently
The results divide our back-ends into two groups: simpler limit the maximum possible bus utilization our engines can
protocols, OBI and AXI Lite, run faster as they require less com- achieve. Nevertheless, our fully decoupled data-flow-oriented
plex legalization logic, whereas more complex protocols require architecture maximizes the utilization of the bus even in
deeper logic and thus run slower. Engines supporting multiple these scenarios.
protocols and ports also run slower due to additional arbitration Fig. 14 shows that even in very deep systems with hun-
logic in their data path. Data width has a powerful impact on dreds of cycles of latency, our engine can achieve almost per-
iDMAE’s speed, mainly due to wider shifters required to align fect utilization for a relatively small transfer granularity of
the data. The additional slowdown at larger data widths can be four times the bus width. This agility in handling transfers
explained by physical routing and placement congestion of the allows us to copy multi-dimensional tensors with a narrow inner
increasingly large buffer in the dataflow element. Address width dimension efficiently.
has little effect on the critical path as it does not pass through The cost of supporting such fine-granular transfers is an
the legalizer cores, whose timing is most notably affected by increased architectural size of the engines’ decoupling buffers.
address width. Increasing the number of outstanding transac- As shown in Fig. 12(c), these scale linearly in the num-
tions sub-linearly degrades timing due to more complex FIFO ber of outstanding transactions to be supported, growing by
management logic required to orchestrate them. roughly 400 GE for each added buffer stage. In our base

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 273

82 kGE to over 1.5 MGE. On the contrary, DMAEs designed


for accelerating accesses to chip peripherals, as shown in the
works of Pullini et al. [38] and Morales et al. [39] trade-off
performance for area efficiency by minimizing buffer space [38]
and supporting only simpler on-chip protocols like AHB [39]
or OBI [38]. Our iDMA can be parameterized to achieve peak
performance as a high-bandwidth engine in HPC systems, see
Section III-E, as well as to require less area (<2 kGE) than the
ultra-lightweight design of Pullini et al.’s µDMA [38].
DMAEs can be grouped according to their system binding:
register, transfer-descriptor, and instruction-based. Engines re-
quiring a high degree of agility [11], [30] or featuring a small
footprint [33], [38], [39] tend to use a register-based interface.
PES write the transfer information in a dedicated register space
and use a read or write operation to a special register location
to launch the transfer. In more memory-compute-decoupled
Fig. 14. Bus utilization of our iDMAE in the base configuration (32-b systems [31] or manycore environments [1], [31] transfer de-
address and data width) with varying amounts of outstanding transactions scriptors prevail. In some MCU platforms [34], [36] DMAEs
in three different memory systems; SRAM, RPC DRAM [28], HBM [14].
are programmed using a custom instruction stream. Gener-
configuration, supporting 32 outstanding transfers keeps the ally, DMAEs only feature one programming interface with
engine area below 25 kGE. some exceptions: both Xilinx’s AXI DMA v7.1 and Syn-
opsys’ DW_axi_dmac support next to their primary transfer-
E. Energy Efficiency descriptors-based also a register-based interface usable with
only a reduced subset of the engines’ features [17], [26]. Our
With iDMA’s area-optimized design, the capability of select-
flexible architecture allows us to use these three system bindings
ing the simplest on-chip protocol, and the effort to minimize
without limiting iDMA’s capabilities. With our standardized
buffer area, we can curtail the area footprint and thus the static
interfaces, any custom binding can be implemented, fully tai-
power consumption of our iDMAEs.
loring the engine to the system it is attached to. Compared to
iDMA’s decoupled, agile, data-flow-oriented architecture is
the prevailing approach of using custom ISAs [34], [36], our
explicitly designed to handle transfers efficiently while max-
inst_64 front-end extends the RISC-V ISA, allowing extremely
imizing bus utilization, limiting the unit’s active time to a
agile programming of complex transfer patterns.
minimum. Coupled with our minimal area footprint and low-
SoC DMAEs, e.g., Fjeldtvedt et al. [30], Rossi et al. [11],
buffer design, this directly minimizes energy consumption.
and Morales et al. [39] feature a fixed configuration of on-
Furthermore, the engine’s efficiency allows run-to-completion
chip protocol(s). Some controllers allow selectively adding a
operating modes where we maximize the interconnect’s pe-
simple interface to connect to peripherals. The DMAIPs from
riods of inactivity between transfers, allowing efficient clock
Synopsys [17] and ARM [34] are two examples. We identify
gating of the iDMAE and the interconnect, further increasing
one exception to this rule: FastVDMA from Antmicro [33]
energy efficiency.
can be configured to select one read- and one write-only port
from a selection of three protocols. FastVDMA only supports
V. RELATED WORK
unidirectional data flow from one read to the other write port.
We compare iDMA to an extensive selection of commercial Inter-port operation, meaning copying data from one port and
DMA solutions and DMAEs used in research platforms; an storing it using the same port, is not supported, which can
overview is shown in Table V. limit its usability. iDMA allows the selective addition of one
In contrast to this work, existing DMAEs are designed for a or multiple read or write interface ports from a list of currently
given system, a family of systems, or even a specific application five industry-standard on-chip protocols. If configured, our en-
on a system. These engines lack modularity and cannot be read- gines allow bidirectional data movement in inter-port and intra-
ily retargeted to a different system. To the best of our knowl- port operations. Thanks to the standardization of interfaces
edge, our work is the first fully modular and universal DMAE and the separation of data movement from protocol handling,
architecture. Moreover, most of the DMAEs in our comparison new on-chip protocols can be added quickly by implementing
are closed-source designs and thus not accessible to the research at most three modules, each only a couple of hundred GEs
community, hindering or even preventing benchmarking and of complexity.
quantitative comparisons. Many DMAEs support transfers with more than one address-
We identify two general categories of DMAEs: large high- ing dimension. Two-dimensional transfers are commonly accel-
bandwidth engines specialized in efficient memory transfers erated in hardware [1], [26], [34]. Fjeldtvedt et al.’s CubeDMA
and low-footprint engines designed for accessing peripherals can even handle three-dimensional transfers. Any higher di-
efficiently. Ma et al. [1], Paraskevas et al. [31], and Rossi et al. mensional transfer is handled in software either by repeti-
[11] present high-performance DMAEs ranging in size from tively launching simpler transfers [11], [30] or by employing

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
274 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024

TABLE V
COMPARISON OF IDMA TO THE SOA

Stream
Supported Transfer Programming Modularity
DMAE Application Technology Modification Area
Protocols Type Model Configurability
Capability

CubeDMA [30] Hyperspectral AXI a Limited 2162 LUTs


FPGA b 3D Register File None
Fjeldtvedt et al. Imaging AXI4 Stream Conf. 1796 FFs

RDMA [31] Transfer


HPC FPGA AXI4 Linear None None N.A.
Paraskevas et al. Descriptors
cDMA [32] c Compression & 0.21 mm2 d
DNN ASIC N.A. Linear None No
Rhu et al. Decompression 420 kGE e
Tech.- STBus f Per-PE Limited ≈ 0.04 mm2
Rossi et al. [11] ULP Linear None
Indep. OBI f g Register File Configurability ≈ 82 kGE

MT-DMA [1] Scientific 2D Transfer Block 1.07 mm2


ASIC Custom None
Ma et al. Computing Arb. Strides Descriptors Transp. 1.5 MGE h
AXI4 f i
FastVDMA [33] General- Tech.- Protocol j
AXI4-Stream f i Linear Register File None 455 Slices
Antmicro purpose Indep. Selectable
Wishbone f i
DMA-330 [34] General- Tech.- AXI3 k 2D Custom Yes, But
l None N.A.
ARM purpose Indep. Peripheral Intf. Scatter-Gather Instructions Non-modular
2745 LUTs m
AXI DMA v7.1 [26] General- FPGA AXI4 f optional 2D Transfer Yes, But
f None 4738 FFs m
Xilinx purpose Xilinx-only AXI4-Stream Scatter-Gather Descriptors Non-modular
216 kb BRAM m
vDMA AXI [35] General- Tech.- 2D Transfer Yes, But
AXI3/4 None N.A.
RAMBUS purpose Indep. Scatter-Gather Descriptors Non-modular
DW_axi_dmac [17] General- Tech.- AXI3/4 2D Transfer Yes, But
l None N.A.
Synopsys purpose Indep. Peripheral Intf. Scatter-Gather Descriptors Non-modular

Dmaengine [36] STBus n Custom


MCU STM32 n Linear None No N.A.
STMicroelectronics Peripheral Intf. Instructions
DDMA [37] Tech.- o
MCU Custom 32-bit Linear, fixed Register File None No N.A.
DCD-SEMI Indep. n
µDMA [38] Tech.- OBI f g Yes, But
MCU fp Linear Register File None 15.4 kGE
Pullini et al. Indep. RX/TX Channels Non-modular

Tech.- AHB f Yes, But


Morales et al. [39] IoT f Linear Register File None 3.2 kGE
Indep. Perif. Intf. Non-modular

General- Tech.- n
Su et al. [40] AXI4 Linear Register File None No N.A.
purpose Indep. n
VDMA [41] 2D
Video Custom N.A. Integrated None No N.A.
Nandan et al. Arb. Strides

n
Comisky et al. [42] MCU N.A. TR Bus Linear Register File None No N.A.

AXI4, AXI4 Lite,


Register File,
Extreme-edge ULP, AXI4-Stream, Optional ND Memory Init.,
This Work Tech.- Transfer Descriptors, Configurable
Datacenter HPC, TL-UL, TL-UH, Arb. Strides In-stream ≥ 2 kGE
Architecture Indep. RISC-V ISA Ext., and Modular
Application-grade OBI Scatter-Gather Accelerator
Custom
Wishbone

2D
This Work HPC AXI4 Configurable
ASIC q
Arb. Strides RISC-V ISA ext. Memory Init. ≈ 75 kGE
Manticore-0432x2 FP-Workloads OBI and Modular
Scatter-Gather
This Work Image AXI4 Configurable
ASIC q
Linear Register File None ≈ 45 kGE
MemPool Processing OBI r and Modular
3D
This Work ULP ASIC q AXI4 Block Configurable
Arb. Strides Register File ≈ 50 kGE
PULP-open ML FPGA q OBI Transp. and Modular
Scatter-Gather
This Work Application- Transfer Configurable
ASIC q
AXI4 Linear None ≈ 60 kGE
Cheshire Grade Descriptors and Modular
Power 3D
This Work Tech.- AXI4 Configurable
Management Arb. Strides Register File None ≈ 61 kGE
ControlPULP Indep. OBI and Modular
MCU Scatter-Gather
This Work ULP Tech.- Configurable
OBI Linear Register File None ≈ 2 kGE
IO-DMA MCU Indep. and Modular
a b c d e
read-only write-only FreePDK45 28 nm node assuming 0.5 µm2 per 1 GE f
cross-protocol operation only
g h
pre-1.0 version assuming 0.7 µm2 per 1 GE i
one read-only and one write-only protocol selectable j
32 b, AXI4 read, AXI4-Stream write k
one manager port,
l m m o
main interface optional UltraScale_mm2s_64DW_1_100 (xcku040, ffva1156, 1) to the best of our knowledge wrapper for APB, AHB, AXI Lite available
p q r
very similar to OBI main target latency-tolerant version

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 275

transfer descriptor chaining [1], [26]. Our tensor_ND mid-end an ND transfer descriptor to the first read request being issued
can execute arbitrary high dimensional transfers in hardware. on the engine’s protocol port. They show high agility, even
Our desc_64 front-end supports descriptor chaining to handle in ultra-deep memory systems. Flexibility and parameteriza-
arbitrarily-shaped transfers without putting any load on the PE. tion allow us to create configurations that achieve asymptoti-
Our flexible architecture can easily accelerate special transfer cally full bus utilization and can fully hide latency in arbitrary
patterns required by a specific application: once programmed, deep memory systems while incurring less than 400 GE per
our novel rt_3D mid-end autonomously fetches strided sensor trackable outstanding transfer. In a 32 b system, our iDMAEs
data without involving any PE. Rhu et al. [32] and Ma et al. achieve almost perfect bus utilization for 16 B-long trans-
[1] present DMAEs able to modify the data while it is copied. fers when accessing an endpoint with 100 cycles of latency.
Although their work proposes a solution for their respective The synthesizable RTL description of iDMA is available free
application space, none of these engines present a standardized and open-source.
interface to exchange stream acceleration modules between
platforms easily. Additionally, both engines, especially MT- ACKNOWLEDGMENTS
DMA, impose substantial area overhead, limiting their appli-
cability in ULP designs. Our engines feature a well-defined The authors thank Fabian Schuiki, Florian Zaruba, Kevin
interface accessing the byte stream while data is copied, al- Schaerer, Axel Vanoni, Tobias Senti, and Michele Raeber for
lowing us to include existing accelerators. Furthermore, we their valuable contributions to the research project.
provide a novel, ultra-lightweight memory initialization feature,
typically requiring less than 100 GE. REFERENCES
FastVDMA [33] shows basic modularity by allowing users [1] S. Ma, Y. Lei, L. Huang, and Z. Wang, “MT-DMA: A DMA controller
to select one read and one write protocol from a list of three supporting efficient matrix transposition for digital signal processing,”
on-chip protocols. Its modularity is thus limited to the back- IEEE Access, vol. 7, pp. 5808–5818, 2019.
[2] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky,
end, and there is neither any facility to change between different “NVIDIA A100 tensor core GPU: Performance and innovation,” IEEE
programming interfaces nor a way of easily adding more com- Micro, vol. 41, no. 2, pp. 29–35, Mar.–Apr. 2021.
plex multi-dimensional affine stream movement support. As [3] S. Lee, S. Hwang, M. J. Kim, J. Choi, and J. H. Ahn, “Future scaling
of memory hierarchy for tensor cores and eliminating redundant shared
presented in this work, our approach tackles these limitations by memory traffic using inter-warp multicasting,” IEEE Trans. Comput.,
specifying and implementing the first fully modular, parametric, vol. 71, no. 12, pp. 3115–3126, 2022.
universal DMAE architecture. [4] D. Blythe, “XeHPC Ponte Vecchio,” in Proc. IEEE Hot Chips 33
Symp. (HCS), Palo Alto, CA, USA: IEEE Computer Society, 2021,
pp. 1–34.
[5] X. Wang, A. Tumeo, J. D. Leidel, J. Li, and Y. Chen, “HAM: Hotspot-
VI. CONCLUSION aware manager for improving communications with 3D-stacked mem-
ory,” IEEE Trans. Comput., vol. 70, no. 6, pp. 833–848, Jun. 2021.
We present iDMA, a modular, highly parametric DMAE [6] R. Branco and B. Lee, “Cache-related hardware capabilities and their
architecture composed of three parts (front-end, mid-end, and impact on information security,” ACM Comput. Surv., vol. 55, no. 6,
back-end), allowing our engines to be customized to suit a wide pp. 1–35, Article 125, Jun. 2023.
[7] V. Nagarajan, D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer
range of systems, platforms, and applications. We showcase its on Memory Consistency and Cache Coherence (Synthesis Lectures on
adaptability and real-life benefits by integrating it into five sys- Computer Architecture), vol. 15. San Rafael, CA, USA: Morgan &
tems targeting various applications. In a minimal Linux-capable Claypool Publishers, 2020, pp. 1–294.
[8] A. Jain and C. Lin, Cache Replacement Policies (Synthesis Lectures
SoC, our architecture increases bus utilization by up to 6× while on Computer Architecture), vol. 14. San Rafael, CA, USA: Morgan &
reducing DMAE resource footprint by more than 10 %. In a Claypool Publishers, 2019, pp. 1–87.
ULP edge-node system, we improve MobileNetV1 inference [9] R. Balasubramonian, N. P. Jouppi, and N. Muralimanohar, Multi-Core
Cache Hierarchies (Synthesis Lectures on Computer Architecture),
performance from 7.9 MAC/cycle to 8.3 MAC/cycle while vol. 6. 2011, pp. 1–153. [Online]. Available: https://api.semanticscholar.
reducing the compute cluster area by 10 % compared to the org/CorpusID:42059734
baseline MCHAN [11]. We demonstrate iDMA’s applicability [10] A. Ottaviano, T. Benz, P. Scheffler, and L. Benini, “Cheshire: A
lightweight, Linux-capable RISC-V host platform for domain-specific
in real-time systems by introducing a small real-time mid-end accelerator plug-in,” IEEE Trans. Circuits Syst., II, Exp. Briefs, vol. 70,
requiring only 11 kGE while completely liberating the core no. 10, pp. 3777–3781, Oct. 2023.
from any periodic sensor polling. [11] D. Rossi, I. Loi, G. Haugou, and L. Benini, “Ultra-low-latency
lightweight DMA for tightly coupled multi-core clusters,” in Proc. 11th
We demonstrate speedups of up to 8.4× and 15.8× in high- ACM Conf. Comput. Frontiers, 2014, pp. 1–10.
performance manycore systems designed for floating-point and [12] A. Ottaviano et al., “ControlPULP: A RISC-V on-chip parallel power
integer workloads, respectively, compared to their baselines controller for many-core HPC processors with FPGA-based hardware-
in-the-loop power and thermal emulation,” 2023, arXiv:2306.09501.
without DMAEs. We evaluate the area, timing, latency, and [13] S. Riedel, M. Cavalcante, R. Andri, and L. Benini, “MemPool: A
performance of iDMA, resulting in area and timing models that scalable manycore architecture with a low-latency shared L1 memory,”
allow us to estimate the synthesized area and timing character- IEEE Trans. Comput., early access, 2023.
[14] F. Zaruba, F. Schuiki, and L. Benini, “A 4096-core RISC-V chiplet
istics of any parameterization within 9 % of the actual result. architecture for ultra-efficient floating-point computing,” in Proc. IEEE
Our architecture enables the creation of both ultra-small Hot Chips 32 Symp. (HCS), Palo Alto, CA, USA, 2020, pp. 1–24.
iDMAEs incurring less than 2 kGE, as well as large high- [15] “AMBA AXI and ACE protocol specification: AXI3, AXI4, and AXI4-
lite ACE and ACE-lite.” Arm Developer. Accessed: Sep. 6, 2022.
performance iDMAEs running at over 1 GHz on a 12 nm node. [Online]. Available: https://developer.arm.com/documentation/ihi0022/
Our back-ends incur only two cycles of latency from accepting hc/?lang=en

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
276 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024

[16] “OBI 1,” Silicon Labs, Inc., version 1.5.0, 2020. [Online]. Avail- [39] H. Morales, C. Duran, and E. Roa, “A low-area direct memory access
able: https://github.com/openhwgroup/programs/blob/master/TGs/cores- controller architecture for a RISC-V based low-power microcontroller,”
task-group/obi/OBI-v1.5.0.pdf in Proc. IEEE 10th Latin Amer. Symp. Circuits Syst. (LASCAS), 2019,
[17] “DesignWare IP solutions for AMBA - AXI DMA controller.” Synopsys. pp. 97–100.
Accessed: Sep. 6, 2022. [Online]. Available: https://www.synopsys.com/ [40] W. Su, L. Wang, M. Su, and S. Liu, “A processor-DMA-based memory
dw/ipdir.php?ds=amba_axi_dma copy hardware accelerator,” in Proc. IEEE 6th Int. Conf. Netw. Archit.
[18] F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Snitch: A tiny pseudo Storage, Jul. 2011, pp. 225–229.
dual-issue processor for area and energy efficient execution of floating- [41] N. Nandan, “High performance DMA controller for ultra HDTV video
point intensive workloads,” IEEE Trans. Comput., vol. 70, no. 11, codecs,” in Proc. IEEE Int. Conf. Consum. Electron. (ICCE), Jan. 2014,
pp. 1845–1860, Nov. 2021. pp. 65–66.
[19] “AXI-Stream protocol specification.” Arm Developer. Accessed: Sep. 6, [42] D. Comisky, S. Agarwala, and C. Fuoco, “A scalable high-performance
2022. [Online]. Available: https://developer.arm.com/documentation/ DMA architecture for DSP applications,” in Proc. Int. Conf. Comput.
ihi0051/b/?lang=en Des., Sep. 2000, pp. 414–419.
[20] “SiFive TileLink specification.” SiFive. Accessed: Sep. 6, 2022. [On-
line]. Available: https://starfivetech.com/uploads/tilelink_spec_1.8.1.pdf Thomas Benz (Graduate Student Member, IEEE)
[21] A. Pullini, D. Rossi, I. Loi, G. Tagliavini, and L. Benini, “Mr.Wolf: received the B.Sc. and M.Sc. degrees in electrical
An energy-precision scalable parallel ultra low power SoC for IoT edge engineering and information technology from ETH
processing,” IEEE J. Solid-State Circuits, vol. 54, no. 7, pp. 1970–1981, Zurich, in 2018 and 2020, respectively. He is cur-
Jul. 2019. rently working toward the Ph.D. degree with the
[22] “HyperBusT M specification,” Cypress Semiconductor, 2019. [Online]. Digital Circuits and Systems group of Prof. Benini.
Available: https://www.infineon.com/dgdl/Infineon-HYPERBUS_ His research interests include energy-efficient high-
SPECIFICATION_LOW_SIGNAL_COUNT_HIGH_PERFORMANCE_ performance computer architectures and the design
DDR_BUS-AdditionalTechnicalInformation-v09_00-EN.pdf?fileId=8ac of ASICs.
78c8c7d0d8da4017d0ed619b05663
[23] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi, and
F. Conti, “DORY: Automatic end-to-end deployment of real-world
DNNs on low-cost IoT MCUs,” IEEE Trans. Comput., vol. 70, no. 8,
Michael Rogenmoser (Graduate Student Member,
pp. 1253–1268, Aug. 2021.
IEEE) received the B.Sc. and M.Sc. degrees in
[24] T. Rosedahl, M. Broyles, C. Lefurgy, B. Christensen, and W.
electrical engineering and information technology
Feng, “Power/performance controlling techniques in OpenPOWER,”
from ETH Zurich, in 2020 and 2021, respectively.
in Proc. High Perform. Comput., Cham, Switzerland: Springer, 2017,
He is working toward the Ph.D. degree with the
pp. 275–289.
Digital Circuits and Systems group of Prof. Benini.
[25] I. Ripoll and R. Ballester-Ripoll, “Period selection for minimal hyper-
His research interests include fault-tolerant process-
period in periodic task systems,” IEEE Trans. Comput., vol. 62, no. 9,
ing architectures and multicore heterogeneous SoCs
pp. 1813–1822, Sep. 2013.
for space.
[26] “AXI DMA v7.1 LogiCORE IP product guide,” AMD Xilinx, 2022.
[Online]. Available: https://docs.xilinx.com/r/en-US/pg021_axi_dma
[27] T. A. Davis and Y. Hu, “The University of Florida sparse matrix
collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1–25, Article
1, Nov. 2011. Paul Scheffler (Graduate Student Member, IEEE)
[28] “256Mb high bandwidth RPC DRAM,” Etron Technology, Inc., revision received the B.Sc. and M.Sc. degrees in electri-
1.0, 2019. [Online]. Available: https://etronamerica.com/wp-content/ cal engineering and information technology from
uploads/2019/05/EM6GA16LGDABMACAEA-RPC-DRAM_Rev.-1.0. ETH Zurich, in 2018 and 2020, respectively. He
pdf is currently working toward the Ph.D. degree with
[29] “Reduced pin count (RPC®) DRAM,” Etron Technology, Inc., the Digital Circuits and Systems group of Prof.
v20052605.0, 2022. [Online]. Available: https://etron.com/wp-content/ Benini. His research interests include accelera-
uploads/2022/06/RPC-DRAM-Overview-Flyer-v20202605.pdf tion of sparse and irregular workloads, on-chip
[30] J. Fjeldtvedt and M. Orlandić, “CubeDMA – Optimizing three- interconnects, manycore architectures, and high-
dimensional DMA transfers for hyperspectral imaging applications,” performance computing.
Microprocessors Microsystems, vol. 65, pp. 23–36, Mar. 2019.
[31] K. Paraskevas et al., “Virtualized multi-channel RDMA with software-
defined scheduling,” Procedia Comput. Sci., vol. 136, pp. 82–90, 2018. Samuel Riedel (Member, IEEE) received the B.Sc.
[32] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and and M.Sc. degrees in electrical engineering and
S. W. Keckler, “Compressing DMA engine: Leveraging activation spar- information technology from ETH Zurich, in 2017
sity for training deep neural networks,” in Proc. IEEE Int. Symp. High and 2019, respectively. He is currently working
Perform. Comput. Archit. (HPCA), 2018, pp. 78–91. toward the Ph.D. degree with the Digital Circuits
[33] “Antmicro releases FastVDMA open-source resource-light DMA con- and Systems group of Prof. Benini. His research
troller.” AB Open. Accessed: Feb. 7, 2023. [Online]. Available: https:// interests include computer architecture, focusing on
abopen.com/news/antmicro-releases-fastvdma-open-source-resource- manycore systems and their programming model.
light-dma-controller/
[34] “CoreLink™DMA-330 DMA controller.” Arm Developer. Accessed:
Dec. 18, 2022. [Online]. Available: https://developer.arm.com/
documentation/ddi0424/d/?lang=en
[35] “DMA AXI IP controller.” Rambus. Accessed: Dec. 18, 2022. [Online].
Available: https://www.plda.com/products/vdma-axi Alessandro Ottaviano (Graduate Student Member,
[36] “Dmaengine overview – STM32MPU.” STMicroelectronics. Accessed: IEEE) received the B.Sc. degree in physical engi-
Dec. 18, 2022. [Online]. Available: https://wiki.st.com/stm32mpu/wiki/ neering from Politecnico di Torino, Italy, and the
Dmaengine_overview M.Sc. degree in electrical engineering as a joint de-
[37] “DDMA, multi-channel DMA controller IP core from DCD-SEMI.” gree between Politecnico di Torino, Grenoble INP -
Design and Reuse. Accessed: Dec. 18, 2022. [Online]. Available: https:// Phelma, and EPFL Lausanne, in 2018 and 2020,
www.design-reuse.com/news/53210/multi-channel-dma-controller-ip- respectively. He is currently working toward the
core-dcd-semi.html Ph.D. degree with the Digital Circuits and Systems
[38] A. Pullini, D. Rossi, G. Haugou, and L. Benini, “uDMA: An autonomous group of Prof. Benini. His research interests include
I/O subsystem for IoT end-nodes,” in Proc. 27th Int. Symp. Power Timing real-time and predictable computing systems and
Model. Optim. Simul. (PATMOS), 2017, pp. 1–8. energy-efficient processor architecture.

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 277

Andreas Kurth (Member, IEEE) received the B.Sc. Luca Benini (Fellow, IEEE) holds the Chair of
and M.Sc. degrees in electrical engineering and Digital Circuits and Systems with ETH Zurich and
information technology from ETH Zurich, in 2014 is a Full Professor with the Università di Bologna.
and 2017, respectively, and the Ph.D. degree with His research interests include the energy-efficient
the Digital Circuits and Systems group of Prof. computing systems design, from embedded to high-
Benini, in 2022. His research interests include performance. He has published more than 1000
the architecture and programming of heterogeneous peer-reviewed papers and five books. He is a fellow
SoCs and accelerator-rich computing systems. of the ACM and a member of Academia Europaea.

Torsten Hoefler (Fellow, IEEE) received the Ph.D.


degree from Indiana University. He is a Professor
in computer science with ETH Zürich, Switzer-
land. His research interests include the performance-
centric software and hardware development. He
is a fellow of the ACM and a member of the
Academia Europaea.

Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.

You might also like