A_High-Performance_Energy-Efficient_Modular_DMA_Engine_Architecture
A_High-Performance_Energy-Efficient_Modular_DMA_Engine_Architecture
Abstract—Data transfers are essential in today’s computing the latter to do useful compute. This function becomes increas-
systems as latency and complex memory access patterns are in- ingly critical with the trend towards physically larger systems
creasingly challenging to manage. Direct memory access engines [2] and ever-increasing memory bandwidths [3]. With Moore’s
(DMAES) are critically needed to transfer data independently
of the processing elements, hiding latency and achieving high Law slowing down, 2.5D and 3D integration are required to
throughput even for complex access patterns to high-latency satisfy future applications’ computational and memory needs,
memory. With the prevalence of heterogeneous systems, DMAEs leading to wider and higher-bandwidth memory systems and
must operate efficiently in increasingly diverse environments. longer access latencies [4], [5].
This work proposes a modular and highly configurable open- Without DMAEs, processing elements (PEs) need to read
source DMAE architecture called intelligent DMA (iDMA), split
into three parts that can be composed and customized indepen- and write data from and to remote memory, often relying on
dently. The front-end implements the control plane binding to deep cache hierarchies to mitigate performance and energy
the surrounding system. The mid-end accelerates complex data overheads. This paper focuses on explicitly managed memory
transfer patterns such as multi-dimensional transfers, scattering, hierarchies, where copies across the hierarchy are handled by
or gathering. The back-end interfaces with the on-chip commu- DMAEs. We refer the interested reader to [6], [7], [8], [9]
nication fabric (data plane). We assess the efficiency of iDMA in
various instantiations: In high-performance systems, we achieve for excellent surveys on cache-based memory systems. Caches
speedups of up to 15.8× with only 1 % additional area compared and DMAEs often coexist in modern computing systems as
to a base system without a DMAE. We achieve an area reduction they address different application needs. Dedicated DMAEs
of 10 % while improving ML inference performance by 23 % are introduced to efficiently and autonomously move data for
in ultra-low-energy edge AI systems over an existing DMAE workloads where memory access is predictable, weakly data-
solution. We provide area, timing, latency, and performance
characterization to guide its instantiation in various systems. dependent, and made in fairly large chunks, decoupling memory
Index Terms—DMA, DMAC, direct memory access, memory
accesses from execution and helping maximize PE time spent
systems, high-performance, energy-efficiency, edge AI, AXI, on useful compute.
TileLink. When integrating DMAEs, three main design challenges
I. INTRODUCTION must be tackled: the control-plane interface to the PEs, the
intrinsic data movement capabilities of the engine, and the on-
D IRECT memory access engines (DMAEs) form the com-
munication backbone of many contemporary computers
[1]. They concurrently move data at high throughput while
chip protocols supported in the data plane. The sheer number of
DMAEs present in literature and available as commercial prod-
hiding memory latency and minimizing processor load, freeing ucts explains why these choices are usually fixed at design time.
The increased heterogeneity in today’s accelerator-rich com-
Manuscript received 6 April 2023; revised 4 October 2023; accepted puting environments leads to even more diverse requirements
29 October 2023. Date of publication 7 November 2023; date of current
version 22 December 2023. This work was supported in part by the European
for DMAEs. Different on-chip protocols, programming models,
High Performance Computing Joint Undertaking (JU) under Framework and application profiles lead to a large variety of different direct
Partnership Agreement 800928 and Specific Grants Agreement 101036168 memory access (DMA) units used in modern system on chips
(EPI SGA2) and 101034126 (The EU Pilot). Recommended for acceptance
by T. Adegbija. (Corresponding author: Thomas Benz.)
(SoCs), hindering integration and verification efforts.
Thomas Benz, Michael Rogenmoser, Paul Scheffler, Samuel Riedel, We present a modular and highly parametric DMAE archi-
Alessandro Ottaviano, and Andreas Kurth are with the Integrated Systems tecture called intelligent DMA (iDMA), which is composed
Laboratory (IIS), ETH Zurich, 8092 Zürich, Switzerland (e-mail: tbenz@
ethz.ch; [email protected]; [email protected]; [email protected]; aottaviano@
of three distinct parts: the front-end handling PE interaction,
ethz.ch; [email protected]). the mid-end managing the engine’s lower-level data movement
Torsten Hoefler is with the Scalable Parallel Computing Laboratory (SPCL), capabilities, and the back-end implementing one or more on-
ETH Zurich, 8092 Zürich, Switzerland (e-mail: [email protected]).
Luca Benini is with the Integrated Systems Laboratory (IIS), ETH Zurich,
chip protocol interfaces. We call concrete implementations of
Switzerland and also with the Department of Electrical, Electronic and our iDMA architecture iDMAEs. All module boundaries are
Information Engineering (DEI), University of Bologna, 40126 Bologna, Italy standardized to facilitate the substitution of individual parts,
(e-mail: [email protected]).
Digital Object Identifier 10.1109/TC.2023.3329930
allowing for the same DMAE architecture to be used across a
0018-9340 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
264 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 265
TABLE II
IDENTIFIER OF IMPLEMENTED MID-ENDS
Mid-end Description
tensor_2D Optimized to accelerate 2D transfers.
tensor_ND Mid-end accelerating ND transfers.
Fig. 2. Outline of the 1D transfer descriptor (exchanged between mid- and mp_split Mid-end splitting transfers along a parametric
back-end). address boundary.
mp_dist Mid-end distributing transfers over multiple back-ends.
TABLE I rt_3D Mid-end repetitively launching 3D transfers designed
IDENTIFIERS AND DESCRIPTIONS OF FRONT-ENDS EMPLOYED IN THE for real-time systems.
USE CASES. FRONT-ENDS IN GRAY ARE AVAILABLE BUT NOT FURTHER
DISCUSSED IN THIS WORK
chaining [1] is supported to allow efficient long and arbitrarily
Front-end Description Conf. shaped transfers.
Instruction-based: We present inst_64, a front-end that can
reg_32 32-b, 1D
reg_32_2d 32-b, 2D be tightly coupled to a RISC-V core encoding iDMA transfers
Core-private register-based
reg_32_3d
configuration interface for ULP-systems
32-b, 3D directly as instructions. For example, a Snitch [18] RISC-V
reg_64 64-b, 1D core using inst_64 can launch a transaction within three cycles,
reg_64_2d 64-b, 2D
enabling highly agile data transfers.
reg_32_rt_3d Core-private register-based system 32-b, 3D
binding supporting our real-time mid-end
desc_64 Transfer-descriptor-based interface 64-bit, 1D B. Mid-End
designed for 64-b systems compatible
with the Linux DMA interface
In iDMA, mid-ends process complex transfers coming from
the front-end and decompose them into one or multiple 1D
inst_64 Interface decoding a custom iDMA 64-b, 2D
RISC-V instructions used in HPC systems transfer(s), which can be handled directly by the back-end. Our
mid-end overview can be found in Table II.
Tensor Mid-ends: To support multi-dimensional transfers,
A. Front-End two distinct mid-ends are provided. Tensor_2D supports 2D
We present three front-end types: a simple and area-efficient transfers through an interface common in embedded systems
register-based scheme, an efficient microcode programming which requires the source and destination strides, the total
interface, and a high-performance descriptor-based front-end, length of the transfer, the base address, and the length of the
as shown in Table I. Our selection is tailored to the cur- 1D transfers it is composed of. Tensor_ND can be parame-
rent set of use cases; different front-ends can easily be cre- terized at compile-time to support transfers on tensors of any
ated, e.g., allowing us to use our descriptor-based binding in dimension N. It is programmed by providing the source ad-
32-b systems. dress, destination address, number of repetitions, and strides for
Register-based: Core-private register-based configuration each dimension.
interfaces are the simplest front-ends. Each PE uses its own Distribution Mid-end: In large manycore systems like
dedicated configuration space to eliminate race conditions while MemPool [13], one centralized actor may want to schedule
programming the DMAE [11]. We employ different memory- the data requests of multiple interconnect manager ports. We
mapped register layouts depending on the host system’s word implement this functionality with a distributed multi-back-end
width and whether a multi-dimensional tensor mid-end is iDMAE. To distribute work among back-ends, we create two
present. The src_address, dst_address, transfer_length, status, specialized mid-ends called mp_split and mp_dist. Mp_split
configuration, and transfer_id registers are shared between all splits a single linear transfer into multiple transfers aligned to
variants. In the case of multi-dimensional configuration, every a parametric address boundary, guaranteeing that no resulting
tensor dimension introduces three additional fields: src_stride, transfer crosses specific address boundaries. Which is required
dst_stride, and num_repetitions. After configuring the shape of when sending distributed transfers to multiple back-ends, see
a transfer, it is launched by reading from transfer_id, which Section III-D. Mp_dist then distributes the split transfers over
returns an incrementing unique transfer ID. The ID last com- multiple parallel downstream mid- or back-ends, arbitrating the
pleted may be read from the status register, enabling transfer- transfers based on their address offsets. Mp_dists’ number of
level synchronization. outgoing ports is set per default to two.
Descriptor-based: As the use of transfer descriptors is com- Real-time Mid-end: Repeated ND transfers are often re-
mon practice in Linux-capable multicore systems [1], [17], we quired for data acquisition tasks such as reading out sensor
provide desc_64, a 64-bit front-end compatible with the Linux arrays in real-time systems featuring complex address maps.
DMA interface. Given a pointer, the front-end uses a dedicated The number of dimensions, N, can be set at compile time. To
manager port to fetch transfer descriptors from memory; cur- relieve the pressure on general-purpose cores already involved
rently, the AXI, AXI-Lite [15], and OBI [16] protocols are sup- in task scheduling and computation, this can be done by iDMA
ported. The descriptors consist of a src_address, dst_address, using a specialized mid-end. The rt_3D mid-end enables a con-
transfer_length, and a run-time backend_configuration, corre- figurable number of repeated 3D transactions whose periodicity
sponding to information required for a 1D transfer. Descriptor and transfer shape are configured via the front-end. A bypass
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
266 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 267
TABLE III
AVAILABLE ON-CHIP PROTOCOLS AND THEIR KEY CHARACTERISTICS. ALL
PROTOCOLS SHARE BYTE ADDRESSABILITY AND A READY-VALID HANDSHAKE
Request Response
Protocol Version Bursts
Channel Channel
256 beats
AXI4+ATOP [15] H.c AW ,b W ,b AR a B ,b R a
or 4 kB c
AXI4 Lite [15] H.c AW b, W b, AR a B b, R a no
AXI4 Stream [19] B Td Td unlimited
OpenHW OBI [16] v1.5.0 D R no
e UH: Fig. 6. (Top) Block diagram of the PULP-open system. (Bottom) Configu-
SiFive TileLink [20] v1.8.1 A R
power of two ration of the cluster iDMAE.
Init gh N.A. N.A. N.A. N.A.
a read b write c whichever is reached first
d symmetrical RX/TX channels e TL-UL & TL-UH supported extensions to accelerate digital signal processing (DSP) and ML
f valid is expressed through cyc, ready through ack (and rti) workloads, enabling energy-efficient ML inference in extreme-
g memory initialization pseudo protocol h read-only supported
edge AI nodes. These cores are connected to an SRAM-based
tightly-coupled data memory (TCDM) with single-cycle ac-
buffered operation may be required depending on the system cess latency, providing the processing cores with fast access to
and the memory endpoints; in this case, the small FIFO buffer shared data. While the TCDM is fast, it is very limited in size;
in the dataflow element may be replaced with an SRAM-based the platform thus features a level-two (L2) on-chip and level-
buffer, allowing entire transfers to be stored. Two shifters, one three (L3) off-chip HyperBus RAM [22]. To allow the cluster
at each end of the dataflow element, align the byte stream to fast access to these larger memories, a DMA unit is embed-
bus boundaries. In-stream accelerators, allowing operations ded, specialized for transferring data from and to the level-one
performed on the data stream during data movement, may be (L1) memory.
integrated into the dataflow element, augmenting the buffer in iDMAE Integration: In the PULP-open system, our iDMAE
the transport layer. Our dataflow-oriented architecture allows is integrated into the processing cluster with a 64-b AXI4 inter-
us to switch between multiple read managers, write managers, face to the host platform and an OBI connection to the TCDM,
and in-stream accelerators in-cycle, allowing our engine to see Fig. 6. The multi-protocol back-end is fed by a tensor_ND
asymptotically reach perfect bus utilization even when the used mid-end, configured to support three dimensions, allowing for
protocols or acceleration schemes change regularly. fast transfer of 3D data structures common in ML workloads.
Protocol Managers: The transport layer abstracts the in- At the same time, higher-dimensional transfers are handled in
terfacing on-chip protocols through read and write managers software. The back- and mid-end are configured through per-
with standardized interfaces, allowing for true multi-protocol core reg_32_3d front-ends and two additional front-ends, allow-
capabilities. Read managers receive the current read transfer’s ing the host processor to configure the iDMAE. Round-robin
base address, transfer length, and protocol-specific configu- arbitration is implemented through a round-robin arbitration
ration information as inputs. They then emit a read-aligned mid-end connecting the front-ends to the tensor_ND mid-end.
stream of data bytes to the downstream transport layer. Write Multiple per-core front-ends ensure atomic DMA access and
managers receive the write transfer information and the write- prevent interference between the cores launching transactions.
aligned stream of data bytes from the upstream transport layer Benchmarks: To evaluate iDMAE performance in a realis-
to be emitted over their on-chip protocol’s manager port. tic application, we use Dory [23] to implement MobileNetV1
Table III provides a complete list of supported protocols. The inference on PULP-open. This workload relies heavily on the
Init pseudo-protocol only provides a read manager emitting a iDMAE to transfer the data for each layer stored in L2 or off-
configurable stream of either the same repeated value, incre- chip in L3 into the cluster’s TCDM in parallel with cluster core
menting values, or a pseudorandom sequence. This enables our computation. 2D, 3D, and very small transfers are frequently
engine to accelerate memory initialization. required for this workload.
In previous versions of PULP-open, MCHAN [11] was used
III. CASE STUDIES to transfer data between the host L2 and the cluster’s TCDM.
To demonstrate the generality and real-world benefits of We assume this as a baseline for our evaluation.
iDMA, we detail its integration into five systems spanning a Results: In PULP-open, iDMAE can almost fully utilize the
wide range of capabilities, from ULP processors for edge AI, bandwidth to the L2 and TCDM in both directions: measuring
to HPC manycore architectures. with the on-board timer, a transfer of 8 KiB from the cluster’s
TCDM to L2 requires 1107 cycles, of which 1024 cycles are
required to transfer the data using a 64-b data bus. The mini-
A. PULP-Open
mal overhead is caused by configuration, system latency, and
PULP-open is a ULP edge compute platform consisting contention with other ongoing memory accesses. During Mo-
of a 32-b RISC-V microcontroller host and a parallel com- bileNetV1 inference, individual cores frequently require short
pute cluster [21]. The compute cluster comprises eight 32-b transfers, incurring a potentially high configuration overhead.
RISC-V cores with custom instruction set architecture (ISA) With its improved tensor_3D mid-end, iDMA improves the
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
268 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024
cores’ utilization and throughput for the network over MCHAN, core in the processing domain from the sDMAE in the I/O
achieving an average of 8.3 MAC/cycle compared to the previ- domain. The sDMAE autonomously realizes periodic external
ously measured 7.9 MAC/cycle. Furthermore, configured with data accesses in hardware, minimizing the context switching
similar queue depths as MCHAN, iDMAE with its reg_32_3d and response latency suffered by the manager core in a pure
achieves a 10 % reduction in the utilized area within a software-centric approach. We consider a PFCT running at
PULP cluster. 500 µs and the PVCT at 50 µs, meaning at least ten task pre-
emptions during one PFCT step with FreeRTOS preemptive
scheduling policy. The measured task context switch time in
B. ControlPULP FreeRTOS for ControlPULP is about 120 clock cycles [12],
ControlPULP [12] is an on-chip parallel microcontroller while iDMAE programming overhead for reading and applying
unit (MCU) employed as a power controller system (PCS) the computed voltages is about 100 clock cycles. From FPGA
for manycore HPC processors. It comprises a single 32-b profiling runs we find that the use of sDMAE saves about
RISC-V manager domain with 512 KiB of L2 scratchpad mem- 2200 execution cycles every scheduling period, thus increasing
ory and a programmable accelerator (cluster domain) hosting the slack of the PVCT task. Autonomous and intelligent data
eight 32-b RISC-V cores and 128 KiB of TCDM. access from the I/O domain is beneficial as it allows the two
A power control firmware (PCF) running on FreeRTOS subsystems to reside in independent power and clock domains
implements a reactive power management policy. ControlPULP that could be put to sleep and woken up when needed, reducing
receives (i) dynamic voltage and frequency scaling (DVFS) the uncore domain’s power consumption.
directives such as frequency target and power budget Our changes add minimal area overhead to the system. In
from high-level controllers and (ii) temperature from the case of eight events and sixteen outstanding transactions,
process-voltage-temperature (PVT) sensors and power the sDMAE is about 11 kGE in size, accounting for an area
consumption from voltage regulator modules (VRMs), and is increase of only 0.001 % of to the original ControlPULP’s area.
tasked to meet its constraints. The PCF consists of two periodic The overhead imposed by sDMAE is negligible when Con-
tasks, periodic frequency control task (PFCT) (low priority) trolPULP is used as an on-chip power manager for a large HPC
and periodic voltage control task (PVCT) (high priority) that processors. It has been shown [12] that the entire ControlPULP
handle the power management policy. only occupies a small area of around 0.1 % on a modern HPC
ControlPULP requires an efficient scheme to collect sensor CPU die.
data at each periodic step without adding overhead to the com-
putation part of the power management algorithm.
iDMAE Integration: As presented by Ottaviano et al. [12], C. Cheshire
the manager domain offloads the computation of the control Cheshire [10] is a minimal, technology-independent, 64-b
action to the cluster domain, which independently collects the Linux-capable SoC based around CVA6 [14]. In its default con-
relevant data from PVT sensors and VRMs. We redesign Con- figuration, Cheshire features a single CVA6 core, but coherent
trolPULP’s data movement paradigm by integrating a second multicore configurations are possible.
dedicated iDMAE, called sensor DMAE (sDMAE), in the man- iDMAE Integration: As Cheshire may be configured with
ager domain to simplify the programming model and redirect multiple cores running different operating systems or SMP
non-computational, high-latency data movement functions to Linux, we connect our iDMAE to the SoC using desc_64.
the manager domain, similar to IBM’s Pstate and Stop en- Descriptors are placed in scratchpad memory (SPM) by a core
gines [24]. Our sDMAE is enhanced with rt_3D, a mid-end and are then, on launch, fetched and executed by our iDMAE.
capable of autonomously launching repeated 3D transactions. This single-write launch ensures atomic operation in multi-
ControlPULP’s architecture is heavily inspired by PULP-open; hart environments. Support for descriptor chaining enables arbi-
iDMAE integration can thus be seen in Fig. 6. The goal of trarily shaped transfers. Furthermore, transfer descriptors allow
the extension is to further reduce software overhead for the for loose coupling between the PEs and DMAE, enabling our
data movement phase, which is beneficial to the controller’s engine to hide the memory endpoint’s latency and freeing the
slack within the control hyperperiod [25]. The sDMAE supports PEs up to do useful work.
several interface protocols, thus allowing the same underlying The used back-end is configured to a data and address width
hardware to handle multiple scenarios. of 64 b and can track eight outstanding transactions, enough to
Benchmarks and results: We evaluate the performance support efficient fine-grained accesses to the SPM and external
of the enhanced sDMAE by executing the PCF on top of memory IPs. A schematic view of Cheshire and the iDMAE
FreeRTOS within an FPGA-based (Xilinx Zynq UltraScale+) configuration can be found in Fig. 7.
hardware-in-the-loop (HIL) framework that couples the pro- Benchmarks: We use the AXI DMA v7.1 [26] from Xilinx
grammable logic implementing the PCS with a power, thermal, integrated into the Cheshire SoC as a comparison. We run
and performance model of the plant running on top of the ARM- synthetic workloads copying data elements of varying lengths,
based Processing System [12]. allowing a more direct comparison of the bus utilization at a
Data movement handled by rt_3D, which allows repeated given transfer length.
3D transactions to be launched, brings several benefits to the Results: Compared to AXI DMA v7.1, iDMAE increases
application scenario under analysis. First, it decouples the main bus utilization by almost 6× when launching fine-grained 64 B
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 269
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
270 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 271
TABLE IV
AREA DECOMPOSITION OF THE DMAE CONFIGURATION USED IN THE PULP-CLUSTER, SEE SECTION III-A. THE BASE AREA IS ALWAYS REQUIRED, THE
CONTRIBUTION OF EACH PROTOCOL ADDED IS SHOWN. IF THE AREA CONTRIBUTION IS NON-ZERO, THE PARAMETER INFLUENCING THE VALUE IS
PROVIDED USING THE BIG-O NOTATION. THE AREA CONTRIBUTION SCALES LINEARLY WITH THE DATA WIDTH (DW) IF NO SCALING IS PROVIDED
3.7 kGE a 1.4 kGE 1.4 kGE 310 GE 310 GE 310 GE 310 GE 310 GE 310 GE 310 GE 310 GE
- 0
O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx) O(NAx)
b 710 GE c 710 GE c 200 GE c 200 GE c 180 GE c 180 GE c 180 GE c 180 GE c 215 GE c 215 GE c 21 GE
1.5 kGE
State
O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW) O(AW)
Legalizer
95 GE 105 GE 7 GE 8 GE 5 GE 5 GE
Backend
Page Split 0 0 0 0 0 0
O(1) O(1) O(1) O(1) O(1) O(1)
20 GE 20 GE
Pow2 Split 0 0 0 0 0 0 0 0 0 0
O(1) O(1)
Contribution
of Each Read/
70 GE 190 GE 30 GE 60 GE 60 GE 60 GE 60 GE 60 GE 35 GE 230 GE 150 GE 55 GE
Write Manager
Respectively
influence of parametrization on our architecture, enabling quick A second step is required to estimate area contributions to the
and accurate estimations when integrating engines into new sys- back-end depending on the parameterization, the number, and
tems. We then use these models to show that iDMA’s area and the type of ports. We created a second param model estimating
timing scale well for any reasonable parameterization. Finally, the influence of the three main parameters, area width (AW),
we present latency results for our back-end and discuss our data width (DW), and the number of outstanding transactions
engine’s performance in three sample memory systems. For im- (NAx), on the back-end’s area contributions. We can estimate
plementation experiments, we use GlobalFoundries’ GF12LP+ the area composition of the back-end with an average error of
technology with a 13-metal stack and 7.5-track standard cell less than 9 %, given both the parameterization and the used
library in the typical process corner. We synthesize our de- read/write protocol ports as input.
signs using Synopsys Design Compiler 2022.12 in topological We provide a qualitative understanding of the influence of
mode to account for place-and-route constraints, congestion, parameterization on area by listing the parameter with the
and physical phenomena. strongest correlation using big-O notation in Table IV.
To outline the accuracy of our modeling approach, we show
the area scaling of four of our iDMAEs for different protocol
A. Area Model configurations, depending on the three main parameters, start-
We focus our evaluation and modeling effort on our ing from the base configuration. The subplots of Fig. 12 present
iDMA back-end, as both the front-end and mid-end are very the change in area when one of the three main parameters is
application- and platform-specific and can only be properly modified and the output of our two linear models combined.
evaluated in-system. Our area model was evaluated at 1 GHz The combined area model tracks the parameter-dependent area
using the typical process corner in GF12LP+. development with an average error of less than 9 %. In those
For each of the back-end’s major area contributors listed in cases where the model deviates, the modeled area is overesti-
Table IV, we fit a set of linear models using non-negative least mated, providing a safe upper bound for the back-end area.
squares. For each parametrization, our models take a vector
containing the number of ports of each protocol as an input. This B. Timing Model
set of models allows us to estimate the area decomposition of We again focus our timing analysis on the back-end, as the
the base hardware of the back-end and the area contributions front-end should be analyzed in-system and mid-ends may be
of any additional protocol port, given a particular parameter- isolated from the iDMAE’s timing by cutting timing paths
ization, with an average error of less than 4 %. For example, between front-, mid-, and back-ends. Our investigation shows
Table IV shows the modeled area decomposition for our base a multiplicative inverse dependency between the longest path
configuration of 32-b address width, 32-b data width, and two in ns and our main parameters. We use the base configuration
outstanding transfers. of the back-end to evaluate our timing model by sweeping
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
272 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024
C. Latency
Our iDMA back-ends have a fixed latency of two cycles
from receiving a 1D transfer from the front-end or the last
mid-end to the read request at a protocol port. Notably, this is
independent of the protocol selection, the number of protocol
ports, and the three main iDMA parameters. This rule only has
one exception: in a back-end without hardware legalization sup-
port, the latency reduces to one cycle. Generally, each mid-end
presented in Table II requires one additional cycle of latency.
We note, however, that the tensor_ND mid-end can be config-
ured to have zero cycles of latency, meaning that even for an
N-dimensional transfer, we can ensure that the first read request
is issued two cycles after the transfer arrives at the mid-end from
the front-end.
Fig. 12. Area scaling of a back-end base configuration (32-b address and D. Standalone Performance
data width, two outstanding transactions). Markers represent the measurement
points and lines the fitted model. We evaluated the out-of-context performance of an iDMAE
in the base configuration copying a 64 KiB transfer fragmented
in individual transfer sizes between 1 B and 1 KiB in three
different memory system models. The analysis is protocol-
agnostic as all implemented protocols support a similar out-
standing transaction mechanism; we thus use AXI4 in this
subsection without loss of generality. The three memory sys-
tems used in our evaluation differ in access cycle latency and
number of outstanding transfers.
SRAM represents the L2 memory found in the PULP-open
system (Section III-A) with three cycles of latency and eight
outstanding transfers. RPC-DRAM uses the characteristics of an
open-source AXI4 controller for the RPC DRAM technology
[28], [29] run at 933 MHz, around thirteen cycles of latency
and support for sixteen outstanding transactions. HBM models
an industry-grade HBM [14] interface with a latency in the
order of 100 cycles and supporting the tracking of more than
Fig. 13. Clock frequency scaling of a back-end base configuration (32-b
64 outstanding transfers.
address and data width, two outstanding transactions). In shallow memory systems, the iDMAE reaches almost
perfect bus utilization copying single bus-sized data transfers
our three main parameters. The tracking of our model is pre- required to track as low as eight outstanding transactions.
sented in Fig. 13 for six representative configurations ranging More outstanding requests are required in deeper memory
from simple OBI to complex multi-protocol configurations in- systems to sustain this perfect utilization. Any transfers smaller
volving AXI. Our timing model achieves an average error of than the bus width will inevitably lead to a substantial drop
less than 4 %. in utilization, meaning that unaligned transfers inherently
The results divide our back-ends into two groups: simpler limit the maximum possible bus utilization our engines can
protocols, OBI and AXI Lite, run faster as they require less com- achieve. Nevertheless, our fully decoupled data-flow-oriented
plex legalization logic, whereas more complex protocols require architecture maximizes the utilization of the bus even in
deeper logic and thus run slower. Engines supporting multiple these scenarios.
protocols and ports also run slower due to additional arbitration Fig. 14 shows that even in very deep systems with hun-
logic in their data path. Data width has a powerful impact on dreds of cycles of latency, our engine can achieve almost per-
iDMAE’s speed, mainly due to wider shifters required to align fect utilization for a relatively small transfer granularity of
the data. The additional slowdown at larger data widths can be four times the bus width. This agility in handling transfers
explained by physical routing and placement congestion of the allows us to copy multi-dimensional tensors with a narrow inner
increasingly large buffer in the dataflow element. Address width dimension efficiently.
has little effect on the critical path as it does not pass through The cost of supporting such fine-granular transfers is an
the legalizer cores, whose timing is most notably affected by increased architectural size of the engines’ decoupling buffers.
address width. Increasing the number of outstanding transac- As shown in Fig. 12(c), these scale linearly in the num-
tions sub-linearly degrades timing due to more complex FIFO ber of outstanding transactions to be supported, growing by
management logic required to orchestrate them. roughly 400 GE for each added buffer stage. In our base
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 273
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
274 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024
TABLE V
COMPARISON OF IDMA TO THE SOA
Stream
Supported Transfer Programming Modularity
DMAE Application Technology Modification Area
Protocols Type Model Configurability
Capability
General- Tech.- n
Su et al. [40] AXI4 Linear Register File None No N.A.
purpose Indep. n
VDMA [41] 2D
Video Custom N.A. Integrated None No N.A.
Nandan et al. Arb. Strides
n
Comisky et al. [42] MCU N.A. TR Bus Linear Register File None No N.A.
2D
This Work HPC AXI4 Configurable
ASIC q
Arb. Strides RISC-V ISA ext. Memory Init. ≈ 75 kGE
Manticore-0432x2 FP-Workloads OBI and Modular
Scatter-Gather
This Work Image AXI4 Configurable
ASIC q
Linear Register File None ≈ 45 kGE
MemPool Processing OBI r and Modular
3D
This Work ULP ASIC q AXI4 Block Configurable
Arb. Strides Register File ≈ 50 kGE
PULP-open ML FPGA q OBI Transp. and Modular
Scatter-Gather
This Work Application- Transfer Configurable
ASIC q
AXI4 Linear None ≈ 60 kGE
Cheshire Grade Descriptors and Modular
Power 3D
This Work Tech.- AXI4 Configurable
Management Arb. Strides Register File None ≈ 61 kGE
ControlPULP Indep. OBI and Modular
MCU Scatter-Gather
This Work ULP Tech.- Configurable
OBI Linear Register File None ≈ 2 kGE
IO-DMA MCU Indep. and Modular
a b c d e
read-only write-only FreePDK45 28 nm node assuming 0.5 µm2 per 1 GE f
cross-protocol operation only
g h
pre-1.0 version assuming 0.7 µm2 per 1 GE i
one read-only and one write-only protocol selectable j
32 b, AXI4 read, AXI4-Stream write k
one manager port,
l m m o
main interface optional UltraScale_mm2s_64DW_1_100 (xcku040, ffva1156, 1) to the best of our knowledge wrapper for APB, AHB, AXI Lite available
p q r
very similar to OBI main target latency-tolerant version
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 275
transfer descriptor chaining [1], [26]. Our tensor_ND mid-end an ND transfer descriptor to the first read request being issued
can execute arbitrary high dimensional transfers in hardware. on the engine’s protocol port. They show high agility, even
Our desc_64 front-end supports descriptor chaining to handle in ultra-deep memory systems. Flexibility and parameteriza-
arbitrarily-shaped transfers without putting any load on the PE. tion allow us to create configurations that achieve asymptoti-
Our flexible architecture can easily accelerate special transfer cally full bus utilization and can fully hide latency in arbitrary
patterns required by a specific application: once programmed, deep memory systems while incurring less than 400 GE per
our novel rt_3D mid-end autonomously fetches strided sensor trackable outstanding transfer. In a 32 b system, our iDMAEs
data without involving any PE. Rhu et al. [32] and Ma et al. achieve almost perfect bus utilization for 16 B-long trans-
[1] present DMAEs able to modify the data while it is copied. fers when accessing an endpoint with 100 cycles of latency.
Although their work proposes a solution for their respective The synthesizable RTL description of iDMA is available free
application space, none of these engines present a standardized and open-source.
interface to exchange stream acceleration modules between
platforms easily. Additionally, both engines, especially MT- ACKNOWLEDGMENTS
DMA, impose substantial area overhead, limiting their appli-
cability in ULP designs. Our engines feature a well-defined The authors thank Fabian Schuiki, Florian Zaruba, Kevin
interface accessing the byte stream while data is copied, al- Schaerer, Axel Vanoni, Tobias Senti, and Michele Raeber for
lowing us to include existing accelerators. Furthermore, we their valuable contributions to the research project.
provide a novel, ultra-lightweight memory initialization feature,
typically requiring less than 100 GE. REFERENCES
FastVDMA [33] shows basic modularity by allowing users [1] S. Ma, Y. Lei, L. Huang, and Z. Wang, “MT-DMA: A DMA controller
to select one read and one write protocol from a list of three supporting efficient matrix transposition for digital signal processing,”
on-chip protocols. Its modularity is thus limited to the back- IEEE Access, vol. 7, pp. 5808–5818, 2019.
[2] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky,
end, and there is neither any facility to change between different “NVIDIA A100 tensor core GPU: Performance and innovation,” IEEE
programming interfaces nor a way of easily adding more com- Micro, vol. 41, no. 2, pp. 29–35, Mar.–Apr. 2021.
plex multi-dimensional affine stream movement support. As [3] S. Lee, S. Hwang, M. J. Kim, J. Choi, and J. H. Ahn, “Future scaling
of memory hierarchy for tensor cores and eliminating redundant shared
presented in this work, our approach tackles these limitations by memory traffic using inter-warp multicasting,” IEEE Trans. Comput.,
specifying and implementing the first fully modular, parametric, vol. 71, no. 12, pp. 3115–3126, 2022.
universal DMAE architecture. [4] D. Blythe, “XeHPC Ponte Vecchio,” in Proc. IEEE Hot Chips 33
Symp. (HCS), Palo Alto, CA, USA: IEEE Computer Society, 2021,
pp. 1–34.
[5] X. Wang, A. Tumeo, J. D. Leidel, J. Li, and Y. Chen, “HAM: Hotspot-
VI. CONCLUSION aware manager for improving communications with 3D-stacked mem-
ory,” IEEE Trans. Comput., vol. 70, no. 6, pp. 833–848, Jun. 2021.
We present iDMA, a modular, highly parametric DMAE [6] R. Branco and B. Lee, “Cache-related hardware capabilities and their
architecture composed of three parts (front-end, mid-end, and impact on information security,” ACM Comput. Surv., vol. 55, no. 6,
back-end), allowing our engines to be customized to suit a wide pp. 1–35, Article 125, Jun. 2023.
[7] V. Nagarajan, D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer
range of systems, platforms, and applications. We showcase its on Memory Consistency and Cache Coherence (Synthesis Lectures on
adaptability and real-life benefits by integrating it into five sys- Computer Architecture), vol. 15. San Rafael, CA, USA: Morgan &
tems targeting various applications. In a minimal Linux-capable Claypool Publishers, 2020, pp. 1–294.
[8] A. Jain and C. Lin, Cache Replacement Policies (Synthesis Lectures
SoC, our architecture increases bus utilization by up to 6× while on Computer Architecture), vol. 14. San Rafael, CA, USA: Morgan &
reducing DMAE resource footprint by more than 10 %. In a Claypool Publishers, 2019, pp. 1–87.
ULP edge-node system, we improve MobileNetV1 inference [9] R. Balasubramonian, N. P. Jouppi, and N. Muralimanohar, Multi-Core
Cache Hierarchies (Synthesis Lectures on Computer Architecture),
performance from 7.9 MAC/cycle to 8.3 MAC/cycle while vol. 6. 2011, pp. 1–153. [Online]. Available: https://api.semanticscholar.
reducing the compute cluster area by 10 % compared to the org/CorpusID:42059734
baseline MCHAN [11]. We demonstrate iDMA’s applicability [10] A. Ottaviano, T. Benz, P. Scheffler, and L. Benini, “Cheshire: A
lightweight, Linux-capable RISC-V host platform for domain-specific
in real-time systems by introducing a small real-time mid-end accelerator plug-in,” IEEE Trans. Circuits Syst., II, Exp. Briefs, vol. 70,
requiring only 11 kGE while completely liberating the core no. 10, pp. 3777–3781, Oct. 2023.
from any periodic sensor polling. [11] D. Rossi, I. Loi, G. Haugou, and L. Benini, “Ultra-low-latency
lightweight DMA for tightly coupled multi-core clusters,” in Proc. 11th
We demonstrate speedups of up to 8.4× and 15.8× in high- ACM Conf. Comput. Frontiers, 2014, pp. 1–10.
performance manycore systems designed for floating-point and [12] A. Ottaviano et al., “ControlPULP: A RISC-V on-chip parallel power
integer workloads, respectively, compared to their baselines controller for many-core HPC processors with FPGA-based hardware-
in-the-loop power and thermal emulation,” 2023, arXiv:2306.09501.
without DMAEs. We evaluate the area, timing, latency, and [13] S. Riedel, M. Cavalcante, R. Andri, and L. Benini, “MemPool: A
performance of iDMA, resulting in area and timing models that scalable manycore architecture with a low-latency shared L1 memory,”
allow us to estimate the synthesized area and timing character- IEEE Trans. Comput., early access, 2023.
[14] F. Zaruba, F. Schuiki, and L. Benini, “A 4096-core RISC-V chiplet
istics of any parameterization within 9 % of the actual result. architecture for ultra-efficient floating-point computing,” in Proc. IEEE
Our architecture enables the creation of both ultra-small Hot Chips 32 Symp. (HCS), Palo Alto, CA, USA, 2020, pp. 1–24.
iDMAEs incurring less than 2 kGE, as well as large high- [15] “AMBA AXI and ACE protocol specification: AXI3, AXI4, and AXI4-
lite ACE and ACE-lite.” Arm Developer. Accessed: Sep. 6, 2022.
performance iDMAEs running at over 1 GHz on a 12 nm node. [Online]. Available: https://developer.arm.com/documentation/ihi0022/
Our back-ends incur only two cycles of latency from accepting hc/?lang=en
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
276 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 1, JANUARY 2024
[16] “OBI 1,” Silicon Labs, Inc., version 1.5.0, 2020. [Online]. Avail- [39] H. Morales, C. Duran, and E. Roa, “A low-area direct memory access
able: https://github.com/openhwgroup/programs/blob/master/TGs/cores- controller architecture for a RISC-V based low-power microcontroller,”
task-group/obi/OBI-v1.5.0.pdf in Proc. IEEE 10th Latin Amer. Symp. Circuits Syst. (LASCAS), 2019,
[17] “DesignWare IP solutions for AMBA - AXI DMA controller.” Synopsys. pp. 97–100.
Accessed: Sep. 6, 2022. [Online]. Available: https://www.synopsys.com/ [40] W. Su, L. Wang, M. Su, and S. Liu, “A processor-DMA-based memory
dw/ipdir.php?ds=amba_axi_dma copy hardware accelerator,” in Proc. IEEE 6th Int. Conf. Netw. Archit.
[18] F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Snitch: A tiny pseudo Storage, Jul. 2011, pp. 225–229.
dual-issue processor for area and energy efficient execution of floating- [41] N. Nandan, “High performance DMA controller for ultra HDTV video
point intensive workloads,” IEEE Trans. Comput., vol. 70, no. 11, codecs,” in Proc. IEEE Int. Conf. Consum. Electron. (ICCE), Jan. 2014,
pp. 1845–1860, Nov. 2021. pp. 65–66.
[19] “AXI-Stream protocol specification.” Arm Developer. Accessed: Sep. 6, [42] D. Comisky, S. Agarwala, and C. Fuoco, “A scalable high-performance
2022. [Online]. Available: https://developer.arm.com/documentation/ DMA architecture for DSP applications,” in Proc. Int. Conf. Comput.
ihi0051/b/?lang=en Des., Sep. 2000, pp. 414–419.
[20] “SiFive TileLink specification.” SiFive. Accessed: Sep. 6, 2022. [On-
line]. Available: https://starfivetech.com/uploads/tilelink_spec_1.8.1.pdf Thomas Benz (Graduate Student Member, IEEE)
[21] A. Pullini, D. Rossi, I. Loi, G. Tagliavini, and L. Benini, “Mr.Wolf: received the B.Sc. and M.Sc. degrees in electrical
An energy-precision scalable parallel ultra low power SoC for IoT edge engineering and information technology from ETH
processing,” IEEE J. Solid-State Circuits, vol. 54, no. 7, pp. 1970–1981, Zurich, in 2018 and 2020, respectively. He is cur-
Jul. 2019. rently working toward the Ph.D. degree with the
[22] “HyperBusT M specification,” Cypress Semiconductor, 2019. [Online]. Digital Circuits and Systems group of Prof. Benini.
Available: https://www.infineon.com/dgdl/Infineon-HYPERBUS_ His research interests include energy-efficient high-
SPECIFICATION_LOW_SIGNAL_COUNT_HIGH_PERFORMANCE_ performance computer architectures and the design
DDR_BUS-AdditionalTechnicalInformation-v09_00-EN.pdf?fileId=8ac of ASICs.
78c8c7d0d8da4017d0ed619b05663
[23] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi, and
F. Conti, “DORY: Automatic end-to-end deployment of real-world
DNNs on low-cost IoT MCUs,” IEEE Trans. Comput., vol. 70, no. 8,
Michael Rogenmoser (Graduate Student Member,
pp. 1253–1268, Aug. 2021.
IEEE) received the B.Sc. and M.Sc. degrees in
[24] T. Rosedahl, M. Broyles, C. Lefurgy, B. Christensen, and W.
electrical engineering and information technology
Feng, “Power/performance controlling techniques in OpenPOWER,”
from ETH Zurich, in 2020 and 2021, respectively.
in Proc. High Perform. Comput., Cham, Switzerland: Springer, 2017,
He is working toward the Ph.D. degree with the
pp. 275–289.
Digital Circuits and Systems group of Prof. Benini.
[25] I. Ripoll and R. Ballester-Ripoll, “Period selection for minimal hyper-
His research interests include fault-tolerant process-
period in periodic task systems,” IEEE Trans. Comput., vol. 62, no. 9,
ing architectures and multicore heterogeneous SoCs
pp. 1813–1822, Sep. 2013.
for space.
[26] “AXI DMA v7.1 LogiCORE IP product guide,” AMD Xilinx, 2022.
[Online]. Available: https://docs.xilinx.com/r/en-US/pg021_axi_dma
[27] T. A. Davis and Y. Hu, “The University of Florida sparse matrix
collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1–25, Article
1, Nov. 2011. Paul Scheffler (Graduate Student Member, IEEE)
[28] “256Mb high bandwidth RPC DRAM,” Etron Technology, Inc., revision received the B.Sc. and M.Sc. degrees in electri-
1.0, 2019. [Online]. Available: https://etronamerica.com/wp-content/ cal engineering and information technology from
uploads/2019/05/EM6GA16LGDABMACAEA-RPC-DRAM_Rev.-1.0. ETH Zurich, in 2018 and 2020, respectively. He
pdf is currently working toward the Ph.D. degree with
[29] “Reduced pin count (RPC®) DRAM,” Etron Technology, Inc., the Digital Circuits and Systems group of Prof.
v20052605.0, 2022. [Online]. Available: https://etron.com/wp-content/ Benini. His research interests include accelera-
uploads/2022/06/RPC-DRAM-Overview-Flyer-v20202605.pdf tion of sparse and irregular workloads, on-chip
[30] J. Fjeldtvedt and M. Orlandić, “CubeDMA – Optimizing three- interconnects, manycore architectures, and high-
dimensional DMA transfers for hyperspectral imaging applications,” performance computing.
Microprocessors Microsystems, vol. 65, pp. 23–36, Mar. 2019.
[31] K. Paraskevas et al., “Virtualized multi-channel RDMA with software-
defined scheduling,” Procedia Comput. Sci., vol. 136, pp. 82–90, 2018. Samuel Riedel (Member, IEEE) received the B.Sc.
[32] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and and M.Sc. degrees in electrical engineering and
S. W. Keckler, “Compressing DMA engine: Leveraging activation spar- information technology from ETH Zurich, in 2017
sity for training deep neural networks,” in Proc. IEEE Int. Symp. High and 2019, respectively. He is currently working
Perform. Comput. Archit. (HPCA), 2018, pp. 78–91. toward the Ph.D. degree with the Digital Circuits
[33] “Antmicro releases FastVDMA open-source resource-light DMA con- and Systems group of Prof. Benini. His research
troller.” AB Open. Accessed: Feb. 7, 2023. [Online]. Available: https:// interests include computer architecture, focusing on
abopen.com/news/antmicro-releases-fastvdma-open-source-resource- manycore systems and their programming model.
light-dma-controller/
[34] “CoreLink™DMA-330 DMA controller.” Arm Developer. Accessed:
Dec. 18, 2022. [Online]. Available: https://developer.arm.com/
documentation/ddi0424/d/?lang=en
[35] “DMA AXI IP controller.” Rambus. Accessed: Dec. 18, 2022. [Online].
Available: https://www.plda.com/products/vdma-axi Alessandro Ottaviano (Graduate Student Member,
[36] “Dmaengine overview – STM32MPU.” STMicroelectronics. Accessed: IEEE) received the B.Sc. degree in physical engi-
Dec. 18, 2022. [Online]. Available: https://wiki.st.com/stm32mpu/wiki/ neering from Politecnico di Torino, Italy, and the
Dmaengine_overview M.Sc. degree in electrical engineering as a joint de-
[37] “DDMA, multi-channel DMA controller IP core from DCD-SEMI.” gree between Politecnico di Torino, Grenoble INP -
Design and Reuse. Accessed: Dec. 18, 2022. [Online]. Available: https:// Phelma, and EPFL Lausanne, in 2018 and 2020,
www.design-reuse.com/news/53210/multi-channel-dma-controller-ip- respectively. He is currently working toward the
core-dcd-semi.html Ph.D. degree with the Digital Circuits and Systems
[38] A. Pullini, D. Rossi, G. Haugou, and L. Benini, “uDMA: An autonomous group of Prof. Benini. His research interests include
I/O subsystem for IoT end-nodes,” in Proc. 27th Int. Symp. Power Timing real-time and predictable computing systems and
Model. Optim. Simul. (PATMOS), 2017, pp. 1–8. energy-efficient processor architecture.
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.
BENZ et al.: A HIGH-PERFORMANCE, ENERGY-EFFICIENT MODULAR DMA ENGINE ARCHITECTURE 277
Andreas Kurth (Member, IEEE) received the B.Sc. Luca Benini (Fellow, IEEE) holds the Chair of
and M.Sc. degrees in electrical engineering and Digital Circuits and Systems with ETH Zurich and
information technology from ETH Zurich, in 2014 is a Full Professor with the Università di Bologna.
and 2017, respectively, and the Ph.D. degree with His research interests include the energy-efficient
the Digital Circuits and Systems group of Prof. computing systems design, from embedded to high-
Benini, in 2022. His research interests include performance. He has published more than 1000
the architecture and programming of heterogeneous peer-reviewed papers and five books. He is a fellow
SoCs and accelerator-rich computing systems. of the ACM and a member of Academia Europaea.
Authorized licensed use limited to: Vignan's Lara Institute of Technology & Science. Downloaded on April 06,2024 at 06:14:06 UTC from IEEE Xplore. Restrictions apply.