0% found this document useful (0 votes)
166 views14 pages

RNN-Based Radio Resource Management On

RNN-Based Radio Resource Management on
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
166 views14 pages

RNN-Based Radio Resource Management On

RNN-Based Radio Resource Management on
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

RNN-Based Radio Resource Management on


Multicore RISC-V Accelerator Architectures
Gianna Paulin , Student Member, IEEE, Renzo Andri , Member, IEEE,
Francesco Conti , Member, IEEE, and Luca Benini , Fellow, IEEE

Abstract— Radio resource management (RRM) is critical in 5G communication standard a necessity for the advancement of
mobile communications due to its ubiquity on every radio device the digital revolution. However, these advancements heavily
and its low latency constraints. The rapidly evolving RRM algo- tighten the already demanding requirements on the hard-
rithms with low latency requirements combined with the dense
and massive 5G base station deployment ask for an on-the-edge ware (HW) and software (SW) layers used in radio communi-
RRM acceleration system with a tradeoff between flexibility, cation, pushing industry and academia toward improving the
efficiency, and cost-making application-specific instruction-set efficiency of radio resource management (RRM) [1].
processors (ASIPs) an optimal choice. In this work, we start RRM typically runs on a radio access network (RAN)
from a baseline, simple RISC-V core and introduce instruction System-on-Chip (SoC) on every base station. The RAN opti-
extensions coupled with software optimizations for maximizing
the throughput of a selected set of recently proposed RRM mizes various tasks, such as, e.g., limited radio frequency com-
algorithms based on models using multilayer perceptrons (MLPs) munication spectrum utilization, transmission power control,
and recurrent neural networks (RNNs). Furthermore, we scale error coding, and beamforming within a very short time period.
from a single-ASIP to a multi-ASIP acceleration system to While doing so, various constraints need to be considered: The
further improve RRM throughput. For the single-ASIP system, communication load needs to be appropriately balanced, every
we demonstrate an energy efficiency of 218 GMAC/s/W and a
throughput of 566 MMAC/s corresponding to an improvement user device needs to be served fairly, the overall throughput
of 10× and 10.6×, respectively, over the single-core system with a should be high, and ideally, everything should be performed
baseline RV32IMC core. For the multi-ASIP system, we analyze with high energy efficiency. Fig. 1 gives a high-level overview
the parallel speedup dependency on the input and output feature of some essential RRM tasks with the most common perfor-
map (FM) size for fully connected and LSTM layers, achieving mance metrics [2]. Traditionally used optimization algorithms
up to 10.2× speedup with 16 cores over a single extended
RI5CY core for single LSTM layers and a speedup of 13.8× for RRM include exhaustive heuristic search methods, iter-
for a single fully connected layer. On the full RRM benchmark ative algorithms [3], [4], nonlinear nonconvex optimization
suite, we achieve an average overall speedup of 16.4×, 25.2×, problems [3], game theory, or Lagrangian relaxations [5].
31.9×, and 38.8× on two, four, eight, and 16 cores, respectively, Recently, however, algorithms based on deep learning (DL)
compared to our single-core RV32IMC baseline implementation. have revolutionized many time-series analysis problems, such
Index Terms— Application-specific instruction-set processor as speech recognition [6], speech synthesis [7], automatic
(ASIP), long short-term memory (LSTM), machine learning, translation [8], and biosignal analysis [9]. Therefore, it is
neural networks, radio resource management (RRM), recurrent no surprise that research has started to tackle RRM using
neural network (RNN), RISC-V.
neural networks [10]–[20]. Compared to the aforementioned
I. I NTRODUCTION traditional iterative algorithms, DL-based models are capable

P EOPLE’S demand for being mobile and continuously


connected with reliable high-speed internet (e.g., for
video streaming) and the increasing number of connected
of autonomously extracting high-level features and (nonlinear)
correlations with a much smaller computational cost, making
their real-time implementation in the time range of millisec-
Internet-of-Things (IoT) devices make the new 5G radio onds much easier [10].
Up until now, RRM tasks addressed by DL models are at
Manuscript received February 24, 2021; revised May 15, 2021; accepted
June 6, 2021. This work was supported by Huawei Technologies Sweden AB. the data-link layer of the OSI model [21]. Recurrent neural
(Corresponding author: Gianna Paulin.) networks (RNNs), often the long short-term memory (LSTM)
Gianna Paulin is with the Integrated System Laboratory, ETH Zürich, version, are successful at carrier sensing and collision detec-
8092 Zürich, Switzerland (e-mail: [email protected]).
Renzo Andri is with the Integrated System Laboratory, ETH Zürich, tion and have the capability to learn and compensate nonlinear-
8092 Zürich, Switzerland, and also with Huawei Technologies, Zurich ities and imperfections that are invariant to the environment in
Research Center, 8050 Zürich, Switzerland (e-mail: renzo.andri@ RF components at runtime [17]. Dynamic resource scheduling
huawei.com).
Francesco Conti is with the Department of Electrical, Electronic and of frequency bands, dynamic range control, and various other
Information Engineering, University of Bologna, 40136 Bologna, Italy network optimizations are successfully addressed by convolu-
(e-mail: [email protected]). tional neural networks (CNNs) and LSTM networks [11], [12].
Luca Benini is with the Integrated System Laboratory, ETH Zürich,
8092 Zürich, Switzerland, and also with the Department of Electrical, Elec- Multilayer perceptrons (MLPs) are used for optimal resource
tronic and Information Engineering, University of Bologna, 40136 Bologna, allocation [10], [13], dynamic transmit power control [14],
Italy (e-mail: [email protected]). dynamic multichannel access [15], beamformer design [10],
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2021.3093242. dynamic spectrum allocation, transmission power control [16],
Digital Object Identifier 10.1109/TVLSI.2021.3093242 and channel access optimizations [19].
1063-8210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 1. Overview of various RRM tasks with their most common optimization constraints and metrics [2]. The resource categories are high-level and can
include many fine-grained tasks, often also targeting metrics from other task categories.

load and compute instruction (1.5×) with minimal area


overhead (3.4%), and no increase in the critical path.
3) On a single-core configuration, our instruction exten-
sions achieve an energy efficiency of 218 GMAC/s/W
and a throughput of 566 MMAC/s in 22FDX technology
at 0.8 V, corresponding to an improvement of 10×
and 10.6×, respectively, over a baseline RISC-V IMC
core. Furthermore, we define a multicore RRM cluster
Fig. 2. Overview of benefits and drawbacks of various HW platforms for configuration, showing that parallel speedups depend
RRM. “+” corresponds to “high” and “−” to “low.” on the input and output FM size for fully connected
layers (FCLs) and LSTM layers. We achieve up to 10.2×
RRM tasks are highly latency-critical and are ideally per- speedup with 16 cores over a single extended RI5CY
formed in situ at the base stations. However, since RRM core for single LSTM layers and up to 13.8× for single
models and algorithms are typically rapidly evolving, while FCLs.
the costs of the base stations should be amortized over a 4) We provide a detailed analysis of parallelizing the RRM
long time period, specialized hardwired accelerators cannot benchmark suite on a range of cluster configurations
provide enough flexibility. Using FPGA fabrics for RRM taking into account speedups, parallelization overhead
acceleration would bring the needed flexibility; however, (average of only 2.55% on bigger RRM models), and
FPGAs are expensive when targeting massive and dense memory transfers. We achieve a total overall speedup
deployment as required in 5G networks and are hard to of 193.7×, 110.4×, and 132.0× on 16 cores and
integrate into bigger Systems-on-Chips (SoCs). In 5G net- 101.9×, 81.5×, and 93.4× on eight cores compared to
work deployments, an approach based on application-specific the single-core RV32IMC baseline implementation when
instruction-set processors (ASIPs) achieves an excellent trade- the complete models fit into 512 kB of L1 memory.
off between efficiency, costs, and flexibility. Furthermore,
This article is structured as follows. Section II describes
integrating an ASIP-based subsystem in complex 5G SoCs
the related work and the selected benchmark suite. Section III
is a very widely practiced approach. In addition, the vari-
describes the system architecture and all microarchitectural
ous high-quality open-source cores based on the open and
and software optimizations. In Section IV, we define an upper
royalty-free RISC-V ISA offer a widely supported standard-
bound for the speedup on the multicore system. Section V dis-
ized baseline to systematically explore the architectural needs
cusses experiments on the single-core and multicore systems.
of DL-based RRM applications. An overview of the benefits
and drawbacks for various HW platforms is shown in Fig. 2.
II. R ELATED W ORK
This work is an extension of Andri et al. [22]. We evaluate
the architectural needs of DL-based RRM applications on an A. RNN for 5G
RV32IMC RI5CY open-source core [23], introducing further The expected benefits of 5G require a frequency band
extensions and support for parallelization. We perform all our utilization in the range of 3–300 GHz, which goes beyond the
evaluations on open-source single-core and multicore archi- previously utilized 300-MHz–3-GHz spectrum. Using these
tectures based on the same core [24]. We make the following higher frequency bands incurs higher path losses, reflections,
contributions.1 and scattering, requiring an ultradense base station deploy-
1) We define a benchmark suite with multiple MLP- and ment. In addition, 5G will be used in combination with
RNN-based RRM applications running on a single-core other radio access technologies, such as 2G, 3G, and LTE-A
cluster configuration with RI5CY and apply various soft- resulting in higher interference. With the tightened device
ware optimizations: xPULP extensions (4.0×), improved power consumption requirements and other equally essential
data reuse through output feature map (OFM) tiling requirements, such as fairness or load balancing, RRM for 5G
(1.7×), and input feature map IFM) tiling (3%). becomes extremely complex [2].
2) We extend RI5CY with RRM-specific hardware instruc- Therefore, finding more efficient RRM algorithms is cru-
tions: custom activation (13% within LSTMs), a merged cial for the successful deployment of 5G technology. With
the recent DL revolution, it comes as no big surprise that
1 Hardware, software and benchmarks have been open sourced on GitHub industry and academia have started to tackle RRM with neural
https://github.com/iis-eth-zurich/RNNASIP networks, which are now considered the SoA approach. The

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PAULIN et al.: RNN-BASED RRM ON MULTICORE RISC-V ACCELERATOR ARCHITECTURES 3

TABLE I
B ENCHMARK S UITE OF T YPICAL M ODELS U SED FOR RRM: LSTM RNN S AND MLP S T HAT A RE B UILT F ROM M ULTIPLE FCL S

used DL models include deep MLPs [10], [13], CNNs, and Model G optimizes the sum throughput and α-fairness in
LSTM RNNs [11], [12]. While attention-based models, such time slot sharing among heterogeneous network nodes with
as transformer [25], have shown state-of-the-art results for all different applied MAC protocols [19]. Model H optimizes
kinds of time-series predictions, to the best of our knowledge, throughput with a more general optimal resource allocation
this new network type has not yet been applied to RRM tasks. problem formulation. Finally, model I is the only CNN-based
In this work, we have selected a comprehensive set of model. It aims at maximizing throughput by power con-
DL-based RRM models as a benchmark suite (see Table I). trol [15]. For more detailed information about the individual
All models are used in a deep reinforcement learning (DL-RL) benchmarks, we refer the interested reader to the references
setup: We have an agent which interacts with an environment. of the corresponding models.
At every timestep, the agent decides based on an observed Currently, these RRM tasks are mainly executed on
state of the environment according to a policy what action to general-purpose processors on the 5G base stations [29].
take, thereby putting the environment in a new state. Based The application-specific customization of these processors can
on the new state, the agent receives a feedback. The policy cover the various needs of the RNN-based applications while
is typically implemented as a function approximation and, still being flexible enough for adapting to rapidly evolving
in our case, is implemented with MLP and/or LSTM models. algorithms. In this work, we present the first, to the best of
Note that, in general, RL allows to perform policy updates our knowledge, dedicated ASIP-based subsystem for running
online based on the received feedback. However, none of RNN-based RRM efficiently.
the selected benchmarks makes use of this feature. Instead, Up until now, all RAN SoCs are heavily proprietary and
the models are trained offline and are deployed on a base closed-source. These designs tend to be offered by a very
station (typically on an RAN SoC). Online training support limited number of vendors and typically require a complete
increases the computational complexity from O(n layers · n 3 ) replacement when upgrading to newer protocols and stan-
for the inference of an MLP with N I = N O = n to dards as their HW and SW designs are heavily coupled.
O(n gradient iterations · n layers · n 3 ). While this can get problem- The virtualization coming with OpenRAN offers operators
atic for the highly latency-critical RRM applications, off-line to run software-based network functions on standard (COTS)
training still allows updating the models as needed periodically servers, allowing them to upgrade software code and hardware
(e.g., once per week) [12]. components more gradually [30]. General-purpose hardware
Furthermore, DNN-based policies are trained with algo- running software-defined stacks, however, is often suboptimal
rithms, such as gradient-descent and backpropagation, which, from a power viewpoint. Recently, a startup called EdgeQ [31]
in contrast to the inference part of DNNs, typically still has announced a developer-accessible RISC-V-based SoC with
requires high-precision floating-point arithmetic. In contrast, custom hardware instructions to accelerate algorithms used for
inference works well with low-precision fixed-point arithmetic, 4G and 5G communication and signal processing. Our own
such as 16-bit [26], 8-bit, or even fewer bits while keeping approach is similar but takes a further step toward openness,
accuracy high [27]. As RRM mainly uses MLP and LSTM relying on: (i) open ISA; (ii) open-source cores and compilers;
layers [28], the latter being well known for quite high sensi- and (iii) open-source architecture [32].
tivity to numerical precision, we use for our implementation
the rather conservative 16-bit integer format, which requires B. ISA Extensions for Domain Specialization
only simple or even no quantization-aware training methods. The idea of extending general-purpose cores with
Models A and B both combine LSTM and MLPs. Model A application-specific instructions is not new. ARM and Intel
maximizes throughput by adapting dynamic channel selection, both offer various matrix computation and vector process-
carrier aggregation, and spectrum access under fairness con- ing extensions in high-performance oriented general-purpose
straints [11]; Model B focuses more on the dynamic spectrum processors. For example, ARM has introduced the AARCH64
access for network utility maximization [12]. Model C is also Neon extensions with the ARMv8-A processor series, includ-
an MLP and minimizes interference under latency constraints ing SIMD instructions for sum-dot-products (e.g., BFDOT)
via channel selections and power control [16]. Model D targets and 2 × 2 matrix–matrix multiplications (e.g., BFMMLA) with
a multichannel access problem for throughput maximiza- two-way SIMD in brain floating-point format bfloat16.
tion [20]. Models E and F tackle the problem of interference Note that this processor series comes in various microarchitec-
channel power control and throughput maximization [10], [14]. tures. For example, the CORTEX-A55 (ARMv8.2-A) has an

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

in-order superscalar microarchitecture, while, e.g., CORTEX- in parallel [38], [39]. CNN can exploit the im2col concept
A75 (ARMv8.2-A) incorporates an out-of-order superscalar of replicating and rearranging FMs for being formulated
pipeline. Intel’s out-of-order SkyLake SP processors extend as a matrix–matrix multiplication problem [38], [39]. This
the x86 ISA with 512-bit wide vector units called AVX512 well-researched reformulation enables the tiling of both the
enabling 16 × 32-bit SIMD vector operation for multipli- input and output FMs spatially in m×n-sized tiles, thereby
cation in single-precision float (FP16) and accumulations enabling the reuse of both weights and input FM pixels, which
in double-precision float (FP32). In 2019, Intel announced ultimately reduces the number of memory loads from O(mn)
its new ×86 microarchitecture called Cascade Lake, which to O(m + n). Since both the MLP and (nonconvolutional)
introduces the AVX512 Vector Neural Network Instructions LSTM layers are based on matrix–vector multiplications, this
(VNNI). AVX512 VNNI implements an 8-bit and 16-bit 2-D tiling cannot be reused.
fixed-point vector product with 32-bit internal accumula- In contrast to CNNs, RNNs require the use of non-
tion [33]. While Intel focuses mostly on the high-performance, linear activation functions, such as hyperbolic tangent and
high-cost processor market, ARM also offers microcon- sigmoid. As their transcendental computation is computa-
trollers in the low-cost and low-power range with the tionally complex, various acceleration approaches have been
Cortex-M family. Recently, ARM introduced the Cortex-M55, proposed: (i) piecewise linear approximation (PLA) [38];
an ultra-low-power in-order microprocessor with the Vector (ii) low-order Taylor series expansion (e.g., second order [40]);
Extensions MVE (Helium). The Helium instructions sup- (iii) lookup table (LUT) with adaptive value granularity [41];
port various single instruction–multiple data (SIMD) instruc- and (iv) small neural network [42]. For our extension, we apply
tions (INT8/16/32, FP16/32), hardware loops, and interleaved the PLA approach and exploit unlike other works the sym-
postincrement load/stores [34]. The Helium extension shows metry property of tanh and sig. In addition, we go a step
that introducing custom ISA extensions is not only beneficial further than, e.g., ARM’s CMSIS-NN library, and evaluate the
for high-performance general-purpose out-of-order cores and error introduced by different numbers of interpolation intervals
superscalar in-order cores but also for small, energy-efficient, and take the applied fixed-point quantization into account [38].
in-order cores with an IPC close to one.
Several academic proposals also introduce specialized III. ASIP-BASED RNN ACCELERATION
instructions: Neves et al. [35] introduce specialized SIMD
instructions optimized for biological sequence alignment algo- A. Baseline RISC-V Architecture
rithms. Guan et al. [36] propose a custom fast Fourier transfor- In this work, we propose an ASIP-based RNN RRM accel-
mation (FFT) instruction for increased throughput on orthog- eration system. We start with designing a single-core ASIP,
onal frequency-division multiplexing (OFDM)-based commu- which we then integrate into a cluster with up to N = 16
nication standards. Opportunities for ASIPs in 5G networks ASIPs. As a starting point for the ASIP, we use PULP’s
are surveyed in [37]. The authors also mention ASIPs for RI5CY core, a four-stage pipeline processor supporting the
ML acceleration as a future direction in 5G applications. RISC-V standard extended with custom instructions (i.e.,
Our work takes a major step in this direction. We focus on RV32IMFCXpulp ISA) [23]. We further extend the RI5CY
RISC-V ISA extensions for two key reasons. First, RISC-V core with a specialized dot product, hyperbolic tangent, and
is designed for extensibility with significant parts of the sigmoid instructions for efficient MLP and LSTM execu-
opcode space reserved for custom extensions. Second, with the tion [22]. We evaluate these custom instructions for throughput
growing community around the open and royalty-free RISC-V and energy efficiency on the selected RRM benchmark suite
ISA, the number of high-quality RISC-V-based open-source in an acceleration subsystem of up to 16 RNN-enhanced
cores and microcontroller systems has grown rapidly. Various RI5CY cores. An overview of the proposed architecture and its
open-source cores already support custom instructions, e.g., integration into a state-of-the-art RAN SoC is shown in Fig. 3.
the RI5CY core from the parallel ultra-low-power (PULP) In the right half, we show the proposed RRM acceleration
project supports custom xPULP instructions, such as SIMD, cluster. We perform the evaluations on a cluster configuration
HW loops, and postincrement loads [23]. In this work, we use with a single RNN ASIP, in which case the modules with a
the open RISC-V ISA and start as a baseline with a multicore dashed border in Fig. 3 would be removed and in a multicore
cluster system based on the mentioned open-source RI5CY configuration where all shown modules are used. For the
core. integration of the proposed ASIP-based acceleration subsystem
into a large 5G RAN SoCs, we propose to connect one of
C. Software Optimizations the proposed systems via a crossbar, e.g., by connecting it
The rise of previously described ISA extensions has led to the system crossbar in Marvell’s Octeon TX2 CN98xx
industry and academia to develop highly optimized SW ker- architecture [29], as shown in the left-hand side of Fig. 3.
nels, such as matrix–vector and matrix–matrix multiplications, The proposed RRM cluster is designed around a config-
which make the best use of these extensions. The used urable number (up to 16) of enhanced-RI5CY cores. The clus-
techniques mainly include the utilization of parallel SIMD ter has no data cache hierarchy in the traditional sense. Instead,
computations and the data reuse within the local register file all ASIP cores share a single-cycle accessible L1 tightly
with appropriate tiling for reduced memory data loads. The coupled data memory (TCDM) with a banking factor of 2,
latter has been commonly used for tiling the output FMs, composed of word-interleaved single-port SRAM macros as
where loaded IFMs can be reused to compute multiple outputs memory banks. The programmer is responsible to ensure that

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PAULIN et al.: RNN-BASED RRM ON MULTICORE RISC-V ACCELERATOR ARCHITECTURES 5

Fig. 3. ASIP RNN subsystem overview including its integration in a typical RAN SoC. The arrows in the RRM cluster show in the master-to-slave direction.
The modules with the dashed line as a border are only used for the multicore cluster configuration, while the modules with the solid lines are used for the
single-core and multicore cluster configurations.

the correct data are loaded via DMA from the L2 memory RNN layer. The final output of the network is computed from
before the ASIP cores try to access them. The required syn- the hidden state of the last RNN layer
chronization schemes are implemented by an event unit. We
use three different cluster configurations for our evaluations, yt = act(Why ht + b y ). (3)
in which we have one, eight, or 16 ASIPs and a 1-MB, 512-kB, LSTM neural networks [44] are a subclass of RNNs spe-
or 64-kB L1 memory. The L2 memory is mapped into the last cialized in learning both short- and long-term time-series
level cache (LLC) memory of the RAN SoC. dependencies. Besides the hidden state ht , LSTMs include an
A DMA engine is connected to the L1 memory via the internal cell state c = (c1 , c2 , . . . , c N H ). Their computation
logarithmic interconnect, to manage data transfers between include matrix–vector multiplication, pointwise vector–vector
L2 and the small L1 embedded within the RRM cluster. The additions/multiplications, and pointwise applied sigmoid and
DMA can be controlled by a single core from the cluster hyperbolic tangent activation functions
side and supports blocking and nonblocking 64-bit/cycle data
transactions in both directions concurrently. The ASIPs fetch it = σ (Wxi xt + Whi ht−1 + bi ) (4)
their instructions from a shared instruction cache. Multiport ft = σ (Wx f xt + Wh f ht−1 + b f ) (5)
memory banks are used for the shared tag and instruction c̃t = tanh(Wxc xt + Whc ht−1 + bc ) (6)
memory. The I-Cache has access to the 64-bit AXI cluster
bus to fetch off-cluster data from the L2 in the case of a ct = ft  ct−1 + it  c̃t (7)
cache-miss. The 64-bit AXI cluster bus also serves DMA data ot = σ (Wxo xt + Who ht−1 + bo ) (8)
transfers from L2 to L1 TCDM. Over a so-called peripheral ht = ot  tanh(ct ) (9)
interconnect, the ASIPs can control and program the DMA
engine, event unit, or further peripherals, e.g., timers. including the input gate i, forget gate f, output gate o, and
the cell state c.
B. Neural RRM Models
The selected set of benchmarks is based on two DL models: C. Enhanced RISC-V ISA
MLPs and LSTM RNNs. MLPs include at least three layers:
In this section, we summarize our custom ISA extensions
one input, at least one hidden, and one output layer. These
used in our ASIP cluster. For more details, we refer the
individual layers, also called fully connected layers, are biased
interested reader to [22].
matrix–vector multiplications transforming N X input features
1) xPULP Extensions: The baseline RI5CY core already
xt at time t into N O output activations yt with the weight
supports various specialized HW instructions, such as SIMD,
Wx ∈ R N O ×N X and bias b y ∈ R N O
HW loops, and postincrement loads under the name xPULP,
yt = W x xt + b y . (1) which we leverage in the first optimization step.
RNNs [43] are a linear superposition of multiple FCLs 2) Tanh and Sigmoid Extension (HW): The two nonlinear
activated by a nonlinear activation function act (typically activation functions used for neural networks, such as LSTM
hyperbolic tangent or sigmoid function) with a feedback over networks, are sigmoid sig and hyperbolic tangent tanh.
the hidden state ht = (h 1 , h 2 , . . . , h N H ) with N H elements Their execution is often emulated in software with the help of
a linear approximation technique requiring multiple iterations
ht = act(Wxh xt + Whh ht−1 + bh ). (2)
until the required precision is reached. This emulation can
RNN networks can contain multiple RNN layers by feeding quickly become a major contributor to the overall execution
the hidden state of one RNN layer as input state to the next time of LSTM networks as, for example, Challita et al. [11]

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 4. RNN RISC-V core with extensions to RI5CY core [23] in blue and datapath for pl.sdotpsp instruction marked in bold lines.

and Naparstek and Cohen [12] show that the calculations of


tanh and sig require together 10.5% and 33.6% of the
overall computation cycles.
Therefore, we introduce two new single-cycle HW instruc-
tions pl.tanh rD, rA and pl.sig rD, rA. They are
implemented as linear function approximation within certain
intervals. The precomputed approximation parameters m and
q are loaded from lookup tables (LUTs) for a certain interval.
After the linear approximation, the result is mirrored if needed
(symmetry property). For more details on the implementation
and an error evaluation, we refer the interested reader to [22].
3) Load and Compute VLIW Instruction (HW):
An evaluation of instruction counts in the benchmarks Fig. 5. Assembly code comparison baseline + output FM tiling, +pl.
[see Table III(c)] shows that the two xPULP instructions sdotsp.h instruction, and +input FM tiling.
lw! and pl.sdotsp.h are by far the two most executed
instructions within our benchmark suite. Thus, we introduce a double nested loop over all inputs and output. All weights
a new instruction that combines the two instructions within a and activations are encoded into the 16-bit Q 3.12 fixed-point
single pl.sdotsp.h instruction, which is capable of loading format, and all computations were validated against a 32-bit
data and calculating the 16-bit packed SIMD sum-dot-product floating-point implementation. The 16-bit quantization can be
applied on the benchmark suite without fixed-point aware
rD[31:0] += rA[31:16]*rB[31:16]
retraining and, therefore, offers a good compromise between
+ rA[15:0]*rB[15:0]. accuracy/robustness and energy efficiency/throughput.
rA contains the memory address, loaded from memory by the 2) Output Feature Map Tiling (SW): A single MAC oper-
load/store unit (LSU), and is incremented for the next data ation requires two inputs, which need two memory loads:
access. The instruction makes use of two special-purpose reg- one for the input feature and one for the weight. While the
isters, SPR, which are written and read in an alternating way weights differ for each input feature, the input features can
(using pl.sdotsp.h.0 and pl.sdotsp.h.1 instruc- be reused for several outputs. The next improvement step
tions), which helps to avoid a two-cycle latency and, thus, exploits this fact by reorganizing the output features in tiles
unnecessary stalling. As the presented customized instructions of d output features over which a loaded input feature is
were implemented for the RI5CY core, Fig. 4 shows the reused. The partial sums of the d output features are kept in
extended RI5CY datapath where the datapath for the extended d processor registers. They are written back to the memory
pl.sdotsp.h instruction is highlighted in blue. once all input features have contributed their part to the
A comparison between the assembly code with (middle) and corresponding results. Algorithm 1 gives a high-level overview
without (left) including the output FM tiling of size d = 4, of the aforementioned output FM tiling.
is shown in Fig. 5. In the middle, the first two pl.sdotsp.h The loaded input feature (line 7) can be used by d
instructions before the HW loop preload the two SPR with the pl.sdotsp instructions (line 11) each performing two mac
first two weights. The corresponding IFM is loaded in line 4, operations on 16-bit data. In total, this results in O(1 + 1/N)
followed by a bubble caused by the latency of the load word loads needed per pl.sdotsp instruction. The FM tile size
instruction and the following instructions’ data dependency. d is optimally smaller or equal to the number of available
processor registers, as, in these cases, the partial results can
be kept locally. Going beyond this limitation would decrease
D. Loop Tiling and Double Buffering efficiency because the intermediate results have to be pushed
1) Baseline: As a baseline, we have developed a straight- back into the memory before being reused.
forward implementation for the FCL and LSTM kernels. In addition, the compiler can rearrange the instructions and
For example, the matrix–vector multiplication makes use of hide the load latency. We empirically determined an optimal

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PAULIN et al.: RNN-BASED RRM ON MULTICORE RISC-V ACCELERATOR ARCHITECTURES 7

Algorithm 1 FCL With Output FM Tiling TABLE II


TCDM C ONTENTION IN % W ITH A LL S INGLE -C ORE O PTIMIZATIONS
E NABLED W ITH A W EIGHT O FFSET OF 6 A GAINST N O W EIGHT
O FFSET ON AN FCL L AYER W ITH N I = N O = 64 E VALUATED
ON A 16-C ORE C LUSTER C ONFIGURATION W ITH 32× L1
M EMORY BANKS

tile size of the assigned weights and features, which costs us


up to 14 instructions.
5) Storing Weights With Address Offset: When multiple
cores access the cluster memory, various cores may want to
access the same memory bank at the same time to load the
weights or features of the model. When this happens, only
one of the cores gets its request granted and receives the
requested data, while the other cores’ memory load requests
get stalled and have to wait. These so-called banking conflicts
can result in many wasted cycles and should be prevented. As
a countermeasure, we implemented an offset in weight storage
in memory, allowing every core to start at a different address.
Table II shows the cycles lost due to TCDM banking conflicts
with a weight offset of six against no weight offset.
6) Double Buffering: The double-buffering concept uses
Fig. 6. Example of the tiling for a linear layer which is distributed on a two buffers to store input data. While the data in the first buffer
four-core ASIP cluster. are being processed (e.g., weights of current network layer),
new data (e.g., weights of next network layer) are loaded
tile size of doptimal = 8 for our implementation, which means into the second buffer. This alternating buffer usage allows
that each core works on multiple rounds of tiles of size for overlapping the computation and DMA transaction phases
d ∈ {8, 4, 2, 1} until no remainder is left. The input features and reduces overall run time. However, entirely hiding the
of convolutional layers can be rearranged and replicated (i.e., DMA transaction is only possible if the computation time of
im2col) in such a way that its computation is mapped on a a network tile is longer than the DMA data transfer time for
matrix–matrix multiplication (as shown by Lai et al. [38] and loading the next tile parameters.
Garofalo et al. [39]), enabling the same tiling optimization. 7) Batching: If the DMA transaction outlasts the compu-
3) Input Feature Map Tiling (SW): To get rid of the bubble, tation time, the cores need to wait and are stalled, thereby
as shown in Fig. 5, another optimization was implemented reducing the speedup. We increase the computation time by
where the loop loads two input data words, which corresponds batching the input activations, which means that the parame-
to four 16-bit input features. This doubles the number of ters of a layer or model are loaded once and are consumed
pl.sdotsp.h instructions in the innermost loop, as shown by computing not only on one feature but also on one or
on the right-hand side in Fig. 5. multiple following input features. However, introducing this
4) Multicore Tiling: For the evaluation of the cluster-based form of batching increases the latency, which, consequently,
configuration on up to 16× enhanced RI5CY cores, the kernels might violate the tight timing constraints for RRM decisions.
and computations are tiled and distributed along the output Hence, batching can be applied to a limited extent.
dimension, which results in every core working on Ñ O output
features in parallel, as follows: IV. M ULTICORE S PEEDUP M ODEL
  To verify the optimality of the parallelized implementation,
NO
Ñ O = . (10) we introduce an upper bound model for the speedup based
Ncores
on Amdahl’s Law and the innermost loop behavior. Amdahl’s
In contrast to tiling along the input dimension, this method Law, as shown in (11), describes the theoretical speedup SU,
does not require any additional overhead for combining partial which can be achieved by running a model on multiple cores
results across multiple cores. Fig. 6 shows conceptually how in parallel. P is the fraction of the code that can be parallelized
the computation on the output FM tiles of a linear layer is with a speedup S
distributed on an exemplary cluster with four RNN ASIPs. 1
Our tiling approach is commonly used [38]; however, we adapt SU = . (11)
(1 − P) + P
the tiling to overlap as many DMA transfers with computation S
phases for optimal performance. As the first step, every core Out of these parameters, P is static and can be measured,
has to compute its assigned tile size, the start address, and their whereas S depends on the computational load. For example,

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

b) Optimality Condition: Once a core is saturated,


optimal Sestimate = Sideal = Ncores can be achieved
!
when (N O /Ncores )%doptimal = Ñ O %doptimal = 0 is
fulfilled, where doptimal = 8.
Following these properties, for a single-core, a multiple of
eight output features are optimal. Other configurations behave
similarly, two cores with a multiple of 16, 4 with 32, 8 with
64, and 16 cores with 128.
V. E VALUATIONS AND R ESULTS
First, we discuss the results for the single-core acceleration
Fig. 7. Sestimate for various N O and various number of active cores.
system. We introduce three different cluster configurations
if the load is perfectly dividable by the number of available for the evaluation of the multicore acceleration system.
cores, the speedup achieves its maximum Sideal = Ncores . How- Finally, we present the HW impacts of the ISA extensions.
ever, any load imbalance can cause a nonideal speedup Sestimate All performance numbers were obtained from an event-based
that we estimate with our model. The following evaluation C++ virtual platform implementation called GVSoC
targets the FCL kernel. For simplicity reasons, we assume that whose cycle accuracy was calibrated with RTL simulation
N I and N O are even numbers. measurements [45].
As the multicore implementation follows the second level A. Single-Core Evaluation
of output FM tiling, each core works on Ñ O output features,
The single-core configuration has an L1 memory size of
as described in (10). Following the first level of output FM
1 MB and only one core, meaning that the modules with
tiling performed on Ñ O of a single core, each core works on
a dashed border in the RRM cluster on the right-hand side
a round of tiles of size d ∈ {8, 4, 2, 1} until no remainder is
of Fig. 3 are nonexistent (N = 1 in Fig. 3). The performance
left. Equation (12) defines Id , the number of instructions in the
results for this cluster configuration are presented in the same
parallelizable part P of the kernel for a single tile of size d.
order in which we applied them. After every step, we focus
The total amount of instructions can then be computed by the
on the largest cycle and instruction count of the optimized
following equations:
⎛ ⎞ implementation to optimize the highest contributors in the
next optimization step (see Table III). The straightforward
⎜ ⎟
⎜ NI ⎟ C-implementation is compiled with standard GCC 7.1.1 for
Id = ⎜ 2 · d + ·(2 · d + 2) + 2 ·

⎟d (12)
⎝ 
4

⎠ RISC-V RV32IMFC ISA and run on the single-core cluster
preload bias shift
& get address 2× data & store configuration with the unoptimized baseline RI5CY core and

in form of v2s
an L1 memory size of 1 MB. With this rather large L1 memory
Itot,Ncores = Ntiles,d · Id (13) size, all models can be stored locally, allowing an evaluation
d∈{8,4,2,1} of a single ASIP independent of any model tiling effects.
 
N O,d The instruction count for the entire benchmark suite is shown
Ntiles,d = (14)
d in Table III(a) and serves as a baseline for all HW and

Ñ O , if d = dmax = 8 SW optimizations implemented on the single-core cluster
N O,d = (15) configuration. Table III(b) shows that the three optimization
( Ñ O %(2 · d)), if d ∈ {4, 2, 1}. techniques, SIMD, HWL, and postincrement load, as described
With the help of these equations, the number of instructions in Section III-C1 achieve a 4.0× reduction in the number
of the parallelizable part P can be estimated for the number of instructions with respect to the unmodified RISC-V IMC
of used cores Ncores ∈ {1, 2, 4, 8, 16}, which allows to esti- baseline.
mate a more accurate achievable speedup Sestimate,Ncores by the The optimal OFM Tiling into tiles of size doptimal = 8,
following equation: as described in Section III-D2, brings an additional improve-
Itot,1 ment of 1.89× on the RRM benchmark suite, as shown
Sestimate,Ncores = . (16) in Table III(c). A more detailed insight into the various
Itot,Ncores
benchmarks is given in Fig. 8. While most networks improved
We analyzed Sestimate for two configurations.
between 1.79× [19] and 1.87× [15], those with smaller OFM
1) N I ∈ {4, 5, . . . , 512} and N O = 128. sizes N O suffer from the higher overhead and achieve a less
2) N O ∈ {4, 5, . . . , 512} and N I = 128. speedup, e.g., 1.07× [13] and 1.30× [12]. The first HW
As the speedup is independent of the number of input features extensions for tanh and sig, as described in Section III-C2,
N I , we only show the dependency of Sestimate on the number are only used by the LSTM-based benchmarks [11], [12].
of output features N O of the FCL kernel in Fig. 7. We define They allow a further cycle count reduction from 51.2 to 44.5
the following properties. kcycles, which results in a 13.0% improvement. The Load
1) N I -Independency: Sestimate is independent of N I . and Compute HW instruction described in Section III-C3 can
2) N O -Dependency: Sestimate depends on N O . again be exploited by all benchmarks and reduces the overall
a) Saturation: A minimum amount of output features cycle count again by 1.7×, as can be seen in Table III(d). The
N O is needed to fill up the various cores. additional IFM Tiling, as described in Section III-D3, gives

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PAULIN et al.: RNN-BASED RRM ON MULTICORE RISC-V ACCELERATOR ARCHITECTURES 9

TABLE III
C YCLE AND I NSTRUCTION C OUNT FOR THE E NTIRE RRM B ENCHMARK S UITE (RISCY IN B OLD AND N EW E XTENSIONS IN B LUE ). (a) W / O OPT
(RV32IMC). (b) +SIMD/HWL (X PULP ). (c) +O UT-FM T ILE ./ TANH / SIG . (d) + PL . SDOTSP I NSTRUCTION . (e) +I NPUT FM T ILING

Fig. 8. Speedup with respect to the RISC-V IMC baseline implementation for a typical neural networks workload in RRM.

TABLE IV 1) Evaluation Cluster: 16× ASIPs, 1 MB L1, and min.


M EASURED O PERATIONS /C YCLE ON AN O PTIMIZED SINGLE -CORE 1-MB L2.
I MPLEMENTATION A GAINST A PARALLELIZABLE SINGLE -CORE
I MPLEMENTATION , S HOWING THE PARALLELIZATION OVERHEAD ,
2) Large Cluster: 16× ASIPs, 512 kB L1, and min.
S UCH AS S YNCHRONIZATION AND T ILE D ETERMINATION 1-MB L2.
3) Small Cluster: 8× ASIPs, 64 kB L1, and min.
512-kB L2.
The first evaluation cluster configuration is only used to
analyze the speedups for the LSTM and FCL kernels. Note
that the large L1 memory, if brought on silicon, would impose
a large cost in terms of the area in a 5G SoC. The second large
cluster configuration allows keeping most of the benchmark
suite locally in the L1 memory. For optimal allocation of the
various resources for wireless communication, often multiple
models are run on the same platform. Therefore, being capable
of loading a complete model at a time comes in handy.
additional modest gain of 1.05× (or 4.9%) [see Table III(e)],
Anyway, an evaluation on this configuration will give an
since loads and stores from the stack increase by 1.4× as more
upper limit for the achievable speedup for smaller cluster
registers are needed.
configurations where some form of the tiling is either needed
In summary, the achieved overall speedup of 10.6× with
to adjust the load of imbalanced layer sizes or simply because
respect to the RISC-V IMC baseline comes from using SIMD
the complete model does not fit into L1. The memory size of
and HWL from the xPULP extension (4.0×), the OFM tiling
the small cluster configuration is the most affordable in terms
(1.7×), the activation function instruction (3.0%), the merged
of silicon real-estate in a 5G SoC and, furthermore, has in
load and compute instruction (1.5×), and the IFM tiling
a similar form already been taped out successfully [24], [46],
(3.0%). In total, we achieve an additional speedup of 3.4×
making it a reasonable choice for a final evaluation of the RRM
compared to the XPulp implementation. The relative benefits
benchmark suite. As the parallelization of CNNs is already
of each optimization step for the full benchmark are shown
analyzed thoroughly in related work [39], we will ignore
in Fig. 8. While the input FM tiling had a positive effect for
the CNN-based model I [15] for the following experimental
most networks, benchmarks with smaller FM need slightly
results.
more cycles caused by the increased stack operations. 2) Multicore, Standalone Kernel, Evaluation Cluster: The
following results of our parallelized LSTM and FCL ker-
B. Multicore Evaluation nels are measured on the evaluation cluster configuration.
1) Multicore Cluster Configurations: We evaluate the indi- The complete evaluation for the FCL kernel, as shown in
vidual kernels and the benchmark suite on three multicore Fig. 9(a)–(d), covers a sweep over the input and output
cluster configurations. FM sizes N I , N O ∈ {32, 64, . . . , 384}. When only two

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 9. Speedup on the evaluation cluster of a single FCL or LSTM layer for various IFM and OFM sizes. (a) FCL: speedup for two cores. (b) FCL:
speedup for four cores. (c) FCL: speedup for eight cores. (d) FCL: speedup for 16 cores. (e) LSTM: speedup for two cores. (f) LSTM: speedup for four
cores. (g) LSTM: speedup for eight cores. (h) LSTM: speedup for 16 cores.
TABLE V
A CHIEVED O P /C YCLE FOR THE L ARGER M ODELS ON THE L ARGE C LUSTER C ONFIGURATION AND THE S MALL C LUSTER C ONFIGURATION W ITH
AND W ITHOUT B ATCHING . T HE R ELATIVE D IFFERENCE I S C OMPARED TO THE A CHIEVED O PERATIONS /C YCLE ON THE
L ARGE C LUSTER C ONFIGURATION

[see Fig. 9(a)] or four cores [see Fig. 9(b)] in the cluster property for the N O -dependency can be observed better for
are enabled (while the others are shut off), the achieved eight or 16 active cores [see Fig. 9(c) and (d)]. Over the
speedup on the FCL kernel saturates for already rather small complete sweep of the FCL kernel, the highest speedups are
N I and/or N O toward its upper limit of SUideal = Ncores . 2×, 4×, 7.7×, and 13.8× for, respectively, two, four, eight,
In contrast, when eight or 16 cores are active, it needs a and 16 cores.
higher output FM size N O than an input FM size N I for The same properties can again be observed in Fig. 9(e)–(h)
the achieved speedup to incline toward the ideal speedup. for the sweep of the LSTM kernel over the IFM and OFM
These observations are aligned with the Saturation and the N I - sizes N I , N O ∈ {2, 34, . . . , 162}. The highest used FM size
independency property of our simplified model in Section IV. for the LSTM sweep is smaller than those used for the FCL
The latter property shows itself in the smooth increase in sweep since an LSTM kernel is larger than an FCL kernel and,
SUmeasured in N I -direction. The N O -dependency of the stated therefore, saturates faster the L1 memory of the evaluation
Saturation gets visible for all plots; however, the Optimality cluster configuration. Over the complete sweep of the LSTM

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PAULIN et al.: RNN-BASED RRM ON MULTICORE RISC-V ACCELERATOR ARCHITECTURES 11

Fig. 10. Speedup when using multiple cores for typical neural networks workload in RRM when the whole model fits into L1. The single-core speedup
corresponds to the achieved speedup with all single-core implementations.

kernel, the highest achieved speedups are 1.9×, 3.5×, 6.1×,


and 10.2× for, respectively, two, four, eight, and 16 cores.
3) Multicore, Benchmark Suite, Large Cluster: When using
the large cluster configuration instead of the single-core
configuration, the implementation was extended with the
necessary synchronization and tile offset computation (see
Section III-D4) implementation. Table IV compares the
achieved MAC/cycle for the implementations when only one
core in the large cluster configuration is active, including
synchronization and tile offset computation, against the
single-core configuration. On average, we lose 0.08 oper-
Fig. 11. Layerwise breakdown of the computation time of each layer, includ-
ations/cycle due to the parallelization overhead, which ing the DMA transcation time, which are overlapped with the computation
corresponds to approximately 2.6%. Taking this overhead for model E [10].
into account, Fig. 10 shows the achieved speedups in relation
to the speedups achieved on the single-core implementation, 64-kB L1 memory, some form of the model tiling is necessary.
including the synchronization and tile offset computation. The first straightforward approach of tiling on the layer-level
A first observation shows that model B [12] and model means that the model is loaded and computed in a layer-
H [13] achieve no or even a worse speedup when using more by-layer fashion with the double-buffering approach described
cores. As listed in Table I all layers of these models have an in Section III-D6. However, models C [16], D [20], and
output FM size of 32 or even mostly less, resulting in highly E [10] have layers that do not entirely fit into the L1 memory
starved cores achieving a maximum speedup of 1.4× against and, thus, require further tiling. As most models are highly
the “parallelized” single-core version. Again, this starvation unbalanced, any form of the tiling is improving load balance.
corresponds to the Saturation property, as stated in Section IV. We focus on the OFM tiling because IFM tiling causes
For these model sizes, only two cores in the large cluster additional overhead for combining results across the cores.
configuration should be enabled as using more cores brings In an exemplary case study, we show a detailed analysis
very limited performance gain while consuming more power. of the applied tiling with its consequences on model E [10],
The medium-sized models A [11], F [14], and G [19] with covering both cases of layer imbalance and too big layer
24k–37k parameters achieve slightly higher speedups of 3.9×, sizes. The first layerwise computation cycle breakdown with
4.8×, and 4.7× when making use of all 16 available cores, staggered DMA transaction cycles (see Section III-D6) is
respectively. The small speedup gain when using 16 PEs shown in Fig. 11. As the individual layers would be too big for
instead of eight PEs clearly shows that using 16 PEs is again the small cluster configuration, these specific measurements
a waste of energy and area for these medium-sized models. were taken from the evaluation cluster configuration. The
Models with bigger layers, such as model C [16], D [20], first look on layer 1 shows clearly that the DMA transaction
and E [10], which each have multiple layers with ≥100 OFM for the second layer (which is progressed in parallel to the
sizes, gain much more by using more cores and achieve layer 1 computation) is much higher than the computation
speedups of 11.7×, 6.9×, and 7.8× when using all 16 cores time for the layer. As layers 2 and 3 are both too big to
and speedups of 6.2×, 5.1×, and 5.5× when using eight cores fit into L1, their necessary tiling coincidentally improves this
against using the parallelized implementation on a single core. load imbalance. Fig. 12(a) shows that, even with tiled layers,
Directly comparing them against our baseline single RISC-V the aforementioned DMA transaction for the second layer
IMC core implementation, this corresponds to total speedups cannot be hidden, and we are bandwidth-limited. When using
of 193.7×, 110.4×, and 132.0× on 16 cores and 101.9×, more than two cores, the situation even gets worse, and we
81.5×, and 93.4× on eight cores. are bandwidth limited over almost all layers, making the
4) Multicore, Benchmark Suite, Small Cluster: With the usage of four or more cores questionable. We have applied
smaller cluster configuration, including eight cores and only similar tiling to all reasonable big models and have listed the

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 12. Layerwise breakdown of the computation time of each layer, including the DMA transactions time for model E [10]. (a) Model E with tiling.
(b) Model E with tiling and batching.

achieved operations/cycles in Table V. We count 1× MAC


as 2× operations. Overall, we achieve up to 6.94, 5.42, and
2.98 Op/cycle for four, two, and one cores, respectively. With
the peak DMA bandwidth of 8 B/cycle, we lose on average
8.9%, 17.2%, and 38.2% to the achieved optimal Op/Cycle on
one, two, and four cores due to the heavy load imbalance of
the RRM benchmarks. We conclude that using four or more
cores brings very limited gain unless the data bandwidth can
be extended to >8 B/cycle. Fig. 13. Area distribution of the extended RI5CY core.
5) Multicore, Benchmark Suite Batching, Small Cluster: the memory in the write-back stage. The extended RI5CY
One possibility to improve the situation is batching, meaning achieves 380 MHz at 0.65 V at typical conditions at room
that we are computing multiple IFM at the same time, temperature. The extensions introduce a small area overhead
allowing us to reuse the loaded parameters. As this doubles of 2.3 kGE, which corresponds to 3.4% of the total core area.
the computation time while keeping the DMA transaction time The area breakdown, as shown in Fig. 13, results from the final
constant, we can push our example model E for most layers placed-and-routed layout. From the performance perspective,
from bandwidth-limited into computation-limited operation. a single extended core performs the relevant benchmarks on
Fig. 12(b) shows the tiled model E, where each layer is average 10.6× faster than the standard RISC-V core with
working simultaneously on two batched IFM. Batching solves RV32-IMC instructions and achieves 566 MMAC/s instead
the bandwidth limitations completely when using only a single of 21 MMAC/s. When the core is using the extensions,
core and almost entirely when using two or four cores. In the power consumption rises from 1.73 to 2.61 mW (51%
addition, the fully bandwidth-limited situation shifts from four total increase). While the decoder contributes little more power
cores to eight cores. Only for the single-core implementation, (approx. 5 μW), the higher power consumption is mainly due
we reach the optimal case as in the large cluster configuration. to the higher utilization of the compute units (ALU and MAC
However, it improves the situation significantly, as shown unit, i.e., 0.57 mW/33% of the total power), the increased GPR
in Table V. We achieve up to 10.17, 6.03, and 3.17 op/cycle usage (0.16 mW/9%), and the higher use of the load-store
for four, two, and one cores correspondingly. The batching, unit (0.05 mW/3%). However, the overall energy efficiency
therefore, enables an average loss of only 15.6% on the at 218 GMAC/s/W shows a 10× improvement. The power
average optimal 10.3 op/cycle, making the usage of four cores consumed by the memories L1, L2, and the rest of the system
in combination with a data bandwidth of 8 B/cycle reasonable. is not taken into account. Taking these energy costs into
It should be noted that this form of batching increases the account would most likely result in energy-efficiency gains
latency of the model, which might violate the tight timing closer to the observed performance gains. The given 10×
constraints for RRM decisions at the physical layer. energy-efficiency improvement can be taken as a safe lower
bound on the overall energy-efficiency improvements for the
C. Hardware Implementation RRM acceleration system.
To analyze the implementation of the extended RI5CY core,
an eight-track low-threshold (LVT) standard cell library of the D. Comparison With Related Work
Globalfoundries 22-nm FDX technology was used. We used Table VI compares performance and energy efficiency num-
Synopsys Design Compiler 18.06 for synthesis and Cadence bers of various related work: a hardwired accelerator [47],
Innovus 18.11 for the back-end flow. The gate-level simula- a GPU [48], a vector processor [49], our baseline RV32IMC
tions with back-annotated delays for the power estimates were core, and our proposed RRM-ASIP. The GPU comes with
run with Modelsim Questa v2019.1 on the final layout. the highest performance and slightly lower energy efficiency
The implemented extensions have no influence on the than the hardwired accelerator. However, with our complete
critical path, which lays between the load-store unit and benchmark set of 3.2 MOp (an average of 320 kOp per

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PAULIN et al.: RNN-BASED RRM ON MULTICORE RISC-V ACCELERATOR ARCHITECTURES 13

TABLE VI R EFERENCES
C OMPARISON W ITH R ELATED W ORK
[1] N. D. Tripathi, J. H. Reed, and H. F. VanLandingham, Radio
Resource Management in Cellular Systems, vol. 618. Springer, 2006.
[Online]. Available: https://books.google.ch/books?hl=de&lr=&id=5dc-
AAAAQBAJ&oi=fnd&pg=PP13&dq=Radio+Resource+Management+in
+Cellular+Systems&ots=6Re5ZlU6Ru&sig=s23Kqqs9Y6MP6ycAOW0
z5a1YHXE#v=onepage&q=Radio%20Resource%20Management%20in
%20Cellular%20Systems&f=false
[2] S. Manap, K. Dimyati, M. N. Hindia, M. S. A. Talip, and R. Tafazolli,
“Survey of radio resource management in 5G heterogeneous networks,”
IEEE Access, vol. 8, pp. 131202–131223, 2020.
[3] M. Naeem, K. Illanko, A. Karmokar, A. Anpalagan, and M. Jaseemud-
din, “Optimal power allocation for green cognitive radio: Fractional
programming approach,” IET Commun., vol. 7, no. 12, pp. 1279–1286,
Aug. 2013.
[4] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted
MMSE approach to distributed sum-utility maximization for a MIMO
interfering broadcast channel,” in Proc. IEEE Int. Conf. Acoust., Speech
model), the compute throughput of 11 TOp/s is dispropor- Signal Process. (ICASSP), May 2011, pp. 3060–3063.
tionately large. The vector processor supports floating-point [5] K. I. Ahmed, H. Tabassum, and E. Hossain, “Deep learning for radio
operations, offering more numerical precision than needed resource allocation in multi-cell networks,” IEEE Netw., vol. 33, no. 6,
pp. 188–195, Nov. 2019.
by our applications, resulting in an increased area cost and [6] A. Hannun et al., “Deep speech: Scaling up end-to-end
lower energy efficiency than our RRM ASIP. The hardwired speech recognition,” 2014, arXiv:1412.5567. [Online]. Available:
accelerator provides the highest energy efficiency; however, https://arxiv.org/abs/1412.5567
[7] H. Ze, A. Senior, and M. Schuster, “Statistical parametric speech
hardwired accelerators are not flexible to adapt to the rapidly synthesis using deep neural networks,” in Proc. IEEE Int. Conf. Acoust.,
changing RRM field as new algorithms would require a Speech Signal Process., May 2013, pp. 7962–7966.
costly HW redesign [37]. Even though our implementation [8] Y. Wu et al., “Google’s neural machine translation system: Bridging
the gap between human and machine translation,” pp. 1–23, 2016,
focuses on 16 bits instead of 8 bits, on a single-ASIP system, arXiv:1609.08144. [Online]. Available: http://arxiv.org/abs/1609.08144
we achieve >2× higher OP/cycle than a comparable commer- [9] M. Zanghieri, S. Benatti, A. Burrello, V. Kartsch, F. Conti, and
cially available dual-issue core (STM32H743). Adapting our L. Benini, “Robust real-time embedded EMG recognition framework
using temporal convolutional networks on a multicore IoT processor,”
HW and SW optimizations for 8-bit could be promising for IEEE Trans. Biomed. Circuits Syst., vol. 14, no. 2, pp. 244–256,
less arithmetic-sensitive applications. Overall, our customized Apr. 2020.
ISA instructions are tailored to the needs of the targeted [10] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
“Learning to optimize: Training deep neural networks for wireless
LSTM and MLP RRM models while maintaining flexibility. resource management,” in Proc. IEEE 18th Int. Workshop Signal
In addition, the number of compute units is adequately scaled Process. Adv. Wireless Commun. (SPAWC), Jul. 2017, pp. 1–6.
to the average benchmark size. [11] U. Challita, L. Dong, and W. Saad, “Proactive resource management
for LTE in unlicensed spectrum: A deep learning perspective,” 2017,
arXiv:1702.07031. [Online]. Available: http://arxiv.org/abs/1702.07031
VI. C ONCLUSION [12] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for
distributed dynamic spectrum access,” IEEE Trans. Wireless Commun.,
In this work, we have identified a selected set of recently vol. 18, no. 1, pp. 310–323, Jan. 2019.
proposed real-world RRM-targeted benchmarks based on [13] M. Eisen, C. Zhang, L. F. O. Chamon, D. D. Lee, and A. Ribeiro,
“Learning optimal resource allocations in wireless systems,” IEEE Trans.
MLPs and RNNs. Starting from a baseline, simple RISC-V Signal Process., vol. 67, no. 10, pp. 2775–2790, May 2019.
core, we first introduce instruction extensions coupled with [14] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learn-
software optimizations for the selected RRM benchmarks ing for dynamic power allocation in wireless networks,” 2018,
arXiv:1808.00490. [Online]. Available: http://arxiv.org/abs/1808.00490
and evaluate them on a single-core and multicore cluster [15] W. Lee, M. Kim, and D.-H. Cho, “Deep power control: Transmit power
acceleration system. For the single-core acceleration system, control scheme based on convolutional neural network,” IEEE Commun.
we demonstrate an energy efficiency of 218 GMAC/s/W and a Lett., vol. 22, no. 6, pp. 1276–1279, Jun. 2018.
[16] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allocation
throughput of 566 MMAC/s corresponding to an improvement in V2V communications,” in Proc. IEEE Int. Conf. Commun. (ICC),
of 10× and 10.6×, respectively, over the single-core system May 2018, pp. 1–6.
with a baseline RV32IMC core. For the multicore acceleration [17] M. Yao, M. Sohul, V. Marojevic, and J. H. Reed, “Artificial intelligence
defined 5G radio access networks,” IEEE Commun. Mag., vol. 57, no. 3,
system, we analyze the parallel speedup dependency on the pp. 14–20, Mar. 2019.
input and output FM size for FCLs and LSTM layers, achiev- [18] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement
ing up to 10.2× speedup with 16 cores over a single extended learning approach to power control and rate adaptation in cellular
networks,” in Proc. IEEE Int. Conf. Commun. (ICC), May 2017, pp. 1–7.
RI5CY core for single LSTM layers and a speedup of 13.8× [19] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning mul-
for single FCL. On the full RRM benchmark suite, we achieve tiple access for heterogeneous wireless networks,” IEEE J. Sel. Areas
an average overall speedup of 16.4×, 25.2×, 31.9×, and Commun., vol. 37, no. 6, pp. 1277–1290, Jun. 2017.
[20] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforce-
38.8× on two, four, eight, and 16 cores, respectively, compared ment learning for dynamic multichannel access in wireless networks,”
to our single-core RV32IMC baseline implementation. IEEE Trans. Cognit. Commun. Netw., vol. 4, no. 2, pp. 257–265,
Jun. 2018.
[21] International Organization for Standardization/International Electrotech-
ACKNOWLEDGMENT nical Commission and others, Information Technology—Open Sys-
tems Interconnection—Basic Reference Model: The Basic Model,
The authors thank Matteo Spallanzani for the valuable Standard ISO/IEC 7498-1:1994, 1994, vol. 427. [Online]. Available:
discussions. https://www.iso.org/standard/20269.html

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

[22] R. Andri, T. Henriksson, and L. Benini, “Extending the RISC-V ISA [47] S. Yin et al., “A high energy efficient reconfigurable hybrid neural
for efficient RNN-based 5G radio resource management,” in Proc. 57th network processor for deep learning applications,” IEEE J. Solid-State
ACM/IEEE Design Autom. Conf. (DAC), Jul. 2020, pp. 1–6. Circuits, vol. 53, no. 4, pp. 968–982, Apr. 2018.
[23] M. Gautschi et al., “Near-threshold RISC-V core with DSP extensions [48] Jetson AGX Xavier Developer Kit Nvidia Corporation, Santa
for scalable IoT endpoint devices,” IEEE Trans. Very Large Scale Integr. Clara, CA, USA, 2019, pp. 1–36. [Online]. Available: https://
(VLSI) Syst., vol. 25, no. 10, pp. 2700–2713, Oct. 2017. developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit
[24] A. Pullini, D. Rossi, I. Loi, G. Tagliavini, and L. Benini, “Mr.Wolf: [49] M. Cavalcante, F. Schuiki, F. Zaruba, M. Schaffner, and L. Benini, “Ara:
An energy-precision scalable parallel ultra low power SoC for IoT edge A 1-GHz+ scalable and energy-efficient RISC-V vector processor with
processing,” IEEE J. Solid-State Circuits, vol. 54, no. 7, pp. 1970–1981, multiprecision floating-point support in 22-nm FD-SOI,” IEEE Trans.
Jul. 2019. Very Large Scale Integr. (VLSI) Syst., vol. 28, no. 2, pp. 530–543,
[25] I. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Feb. 2020.
Process. Syst., vol. 30, pp. 5998–6008, 2017. [50] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi, and
[26] D. Lin, S. Talathi, and S. Annapureddy, “Fixed point quantization of F. Conti, “DORY: Automatic end-to-end deployment of real-world
deep convolutional networks,” in Proc. Int. Conf. Mach. Learn., 2016, DNNs on low-cost IoT MCUs,” Tech. Rep., Aug. 2020. [Online].
pp. 2849–2858. Available: https://ieeexplore.ieee.org/abstract/document/9381618
[27] B. Jacob et al., “Quantization and training of neural networks for
efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf. Gianna Paulin (Student Member, IEEE) received
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704–2713. the B.Sc. and M.Sc. degrees in electrical engineering
[28] G. L. Santos, P. T. Endo, D. Sadok, and J. Kelner, “When 5G meets and information technology from the Swiss Fed-
deep learning: A systematic review,” Algorithms, vol. 13, no. 9, p. 208, eral Institute of Technology, ETH Zürich (ETHZ),
Sep. 2020. Zürich, Switzerland, in 2017 and 2019, respectively,
[29] Marvell OCTEON TX2 DPDK Overview, Marvell, Hamilton, Bermuda, where she is working toward the Ph.D. degree at the
2020. Integrated Systems Laboratory.
[30] M. Yang, Y. Li, D. Jin, L. Su, S. Ma, and L. Zeng, “OpenRAN: Her main interests lay in reduced precision deep
A software-defined ran architecture via virtualization,” ACM SIGCOMM learning from the algorithmic and hardware acceler-
Comput. Commun. Rev., vol. 43, no. 4, pp. 549–550, 2013. ation aspect with a focus on time series applications
[31] EdgeQ. Accessed: Feb. 24, 2021. [Online]. Available: www.edgeq.io and low power embedded systems.
[32] OpenHW. Accessed: Feb. 24, 2021. [Online]. Available: www.
openhwgroup.org
[33] Intel Corp. (2019). Intel-Architecture Instruction Set Extensions Renzo Andri (Member, IEEE) received the B.Sc.,
and Future Features Programming Reference. [Online]. Available: M.Sc., and Ph.D. degrees in electrical engineer-
https://software.intel.com/en-us/download/intel-architecture-instruction- ing and information technology from ETH Zürich,
set-extensions-and-future-features-programming-reference Zürich, Switzerland, in 2013, 2015, and 2020,
[34] J. Yiu, “Introduction to Armv8.1-M architecture,” ARM, White Paper, respectively.
Feb. 2019, pp. 1–14. [Online]. Available: https://pages.arm.com/rs/312- He is currently a Senior Researcher with the
SAX-488/images/Introduction_to_Armv8.1-M_architecture.pdf Computing Systems Laboratory, Huawei Technolo-
[35] N. Neves, N. Sebastiao, D. Matos, P. Tomas, P. Flores, and N. Roma, gies, Zurich Research Center, Zürich. His research
“Multicore SIMD ASIP for next-generation sequencing and alignment focuses on energy-efficient machine learning accel-
biochip platforms,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., eration from embedded system design to full-custom
vol. 23, no. 7, pp. 1287–1300, Jul. 2015. IC design.
[36] X. Guan, Y. Fei, and H. Lin, “Hierarchical design of an application- Dr. Andri won the IEEE TCAD Donald O. Pederson Award in 2019.
specific instruction set processor for high-throughput and scalable FFT
processing,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20,
no. 3, pp. 551–563, Mar. 2012. Francesco Conti (Member, IEEE) received the
[37] S. Shahabuddin, A. Mammela, M. Juntti, and O. Silven, “ASIP for 5G Ph.D. degree in electronic engineering from the
and beyond: Opportunities and vision,” IEEE Trans. Circuits Syst. II, University of Bologna, Bologna, Italy, in 2016.
Exp. Briefs, vol. 68, no. 3, pp. 851–857, Mar. 2021. From 2016 to 2020, he was a Post-Doctoral
Researcher with the Integrated Systems Labora-
[38] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient neural network
tory, Digital Systems Group, ETH Zürich, Zürich,
kernels for arm cortex-M CPUs,” in Proc. Int. Conf. Hardw./Softw.
Switzerland. He is currently an Assistant Pro-
Codesign Syst. Synth., 2018, pp. 1–2.
fessor with the DEI Department, University of
[39] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “PULP- Bologna. He focuses on the development of
NN: Accelerating quantized neural networks on parallel ultra-low-power deep learning-based intelligence on top of ultra-
RISC-V processors,” Phil. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., low power, ultra-energy efficient programmable
vol. 378, no. 2164, Feb. 2020, Art. no. 20190155. Systems-on-Chip—from both the hardware and software perspective.
[40] C.-W. Lin and J.-S. Wang, “A digital circuit design of hyperbolic tangent Dr. Conti’s work has resulted in more than 40 publications in international
sigmoid function for neural networks,” in Proc. IEEE Int. Symp. Circuits conferences and journals and has been awarded several times, including the
Syst., May 2008, pp. 856–859. 2020 IEEE TCAS-I Darlington Best Paper Award.
[41] K. Leboeuf, A. H. Namin, R. Muscedere, H. Wu, and M. Ahmadi, “High
speed VLSI implementation of the hyperbolic tangent sigmoid function,”
in Proc. 3rd Int. Conf. Converg. Hybrid Inf. Technol., vol. 1, 2008,
pp. 1070–1073.
[42] C.-H. Tsai, Y.-T. Chih, W. H. Wong, and C.-Y. Lee, “A hardware- Luca Benini (Fellow, IEEE) has served as the Chief
efficient sigmoid function with adjustable precision for a neural network Architect for the Platform2012 in STMicroelectron-
system,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 62, no. 11, ics, Grenoble, France. He is currently the Chair
pp. 1073–1077, Nov. 2015. of Digital Circuits and Systems with ETH Zürich,
[43] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- Zürich, Switzerland, and a Full Professor with the
resentations by back-propagating errors,” Nature, vol. 323, no. 6088, University of Bologna, Bologna, Italy. He is also
pp. 533–536, Oct. 1986. active in the area of energy-efficient smart sensors
[44] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural and sensor networks. He has published more than
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. 1000 articles in peer-reviewed international journals
[45] É. F. Zulian, G. Haugou, C. Weis, M. Jung, and N. Wehn, “System and conferences, four books, and several book chap-
simulation with PULP virtual platform and SystemC,” in Proc. Conf. ters. His research interests are in energy-efficient
Rapid Simulation Perform. Eval. Methods Tools, Jan. 2020, pp. 1–7. system and multicore System-on-Chips (SoC) design.
[46] (2018). GreenWaves Technologies Unveils GAP8 Processor for AI at the Dr. Benini is also a Fellow of the ACM and a member of the Academia
Edge. [Online]. Available: https://venturebeat.com/ Europaea.

Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.

You might also like