RNN-Based Radio Resource Management On
RNN-Based Radio Resource Management On
Abstract— Radio resource management (RRM) is critical in 5G communication standard a necessity for the advancement of
mobile communications due to its ubiquity on every radio device the digital revolution. However, these advancements heavily
and its low latency constraints. The rapidly evolving RRM algo- tighten the already demanding requirements on the hard-
rithms with low latency requirements combined with the dense
and massive 5G base station deployment ask for an on-the-edge ware (HW) and software (SW) layers used in radio communi-
RRM acceleration system with a tradeoff between flexibility, cation, pushing industry and academia toward improving the
efficiency, and cost-making application-specific instruction-set efficiency of radio resource management (RRM) [1].
processors (ASIPs) an optimal choice. In this work, we start RRM typically runs on a radio access network (RAN)
from a baseline, simple RISC-V core and introduce instruction System-on-Chip (SoC) on every base station. The RAN opti-
extensions coupled with software optimizations for maximizing
the throughput of a selected set of recently proposed RRM mizes various tasks, such as, e.g., limited radio frequency com-
algorithms based on models using multilayer perceptrons (MLPs) munication spectrum utilization, transmission power control,
and recurrent neural networks (RNNs). Furthermore, we scale error coding, and beamforming within a very short time period.
from a single-ASIP to a multi-ASIP acceleration system to While doing so, various constraints need to be considered: The
further improve RRM throughput. For the single-ASIP system, communication load needs to be appropriately balanced, every
we demonstrate an energy efficiency of 218 GMAC/s/W and a
throughput of 566 MMAC/s corresponding to an improvement user device needs to be served fairly, the overall throughput
of 10× and 10.6×, respectively, over the single-core system with a should be high, and ideally, everything should be performed
baseline RV32IMC core. For the multi-ASIP system, we analyze with high energy efficiency. Fig. 1 gives a high-level overview
the parallel speedup dependency on the input and output feature of some essential RRM tasks with the most common perfor-
map (FM) size for fully connected and LSTM layers, achieving mance metrics [2]. Traditionally used optimization algorithms
up to 10.2× speedup with 16 cores over a single extended
RI5CY core for single LSTM layers and a speedup of 13.8× for RRM include exhaustive heuristic search methods, iter-
for a single fully connected layer. On the full RRM benchmark ative algorithms [3], [4], nonlinear nonconvex optimization
suite, we achieve an average overall speedup of 16.4×, 25.2×, problems [3], game theory, or Lagrangian relaxations [5].
31.9×, and 38.8× on two, four, eight, and 16 cores, respectively, Recently, however, algorithms based on deep learning (DL)
compared to our single-core RV32IMC baseline implementation. have revolutionized many time-series analysis problems, such
Index Terms— Application-specific instruction-set processor as speech recognition [6], speech synthesis [7], automatic
(ASIP), long short-term memory (LSTM), machine learning, translation [8], and biosignal analysis [9]. Therefore, it is
neural networks, radio resource management (RRM), recurrent no surprise that research has started to tackle RRM using
neural network (RNN), RISC-V.
neural networks [10]–[20]. Compared to the aforementioned
I. I NTRODUCTION traditional iterative algorithms, DL-based models are capable
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. Overview of various RRM tasks with their most common optimization constraints and metrics [2]. The resource categories are high-level and can
include many fine-grained tasks, often also targeting metrics from other task categories.
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
B ENCHMARK S UITE OF T YPICAL M ODELS U SED FOR RRM: LSTM RNN S AND MLP S T HAT A RE B UILT F ROM M ULTIPLE FCL S
used DL models include deep MLPs [10], [13], CNNs, and Model G optimizes the sum throughput and α-fairness in
LSTM RNNs [11], [12]. While attention-based models, such time slot sharing among heterogeneous network nodes with
as transformer [25], have shown state-of-the-art results for all different applied MAC protocols [19]. Model H optimizes
kinds of time-series predictions, to the best of our knowledge, throughput with a more general optimal resource allocation
this new network type has not yet been applied to RRM tasks. problem formulation. Finally, model I is the only CNN-based
In this work, we have selected a comprehensive set of model. It aims at maximizing throughput by power con-
DL-based RRM models as a benchmark suite (see Table I). trol [15]. For more detailed information about the individual
All models are used in a deep reinforcement learning (DL-RL) benchmarks, we refer the interested reader to the references
setup: We have an agent which interacts with an environment. of the corresponding models.
At every timestep, the agent decides based on an observed Currently, these RRM tasks are mainly executed on
state of the environment according to a policy what action to general-purpose processors on the 5G base stations [29].
take, thereby putting the environment in a new state. Based The application-specific customization of these processors can
on the new state, the agent receives a feedback. The policy cover the various needs of the RNN-based applications while
is typically implemented as a function approximation and, still being flexible enough for adapting to rapidly evolving
in our case, is implemented with MLP and/or LSTM models. algorithms. In this work, we present the first, to the best of
Note that, in general, RL allows to perform policy updates our knowledge, dedicated ASIP-based subsystem for running
online based on the received feedback. However, none of RNN-based RRM efficiently.
the selected benchmarks makes use of this feature. Instead, Up until now, all RAN SoCs are heavily proprietary and
the models are trained offline and are deployed on a base closed-source. These designs tend to be offered by a very
station (typically on an RAN SoC). Online training support limited number of vendors and typically require a complete
increases the computational complexity from O(n layers · n 3 ) replacement when upgrading to newer protocols and stan-
for the inference of an MLP with N I = N O = n to dards as their HW and SW designs are heavily coupled.
O(n gradient iterations · n layers · n 3 ). While this can get problem- The virtualization coming with OpenRAN offers operators
atic for the highly latency-critical RRM applications, off-line to run software-based network functions on standard (COTS)
training still allows updating the models as needed periodically servers, allowing them to upgrade software code and hardware
(e.g., once per week) [12]. components more gradually [30]. General-purpose hardware
Furthermore, DNN-based policies are trained with algo- running software-defined stacks, however, is often suboptimal
rithms, such as gradient-descent and backpropagation, which, from a power viewpoint. Recently, a startup called EdgeQ [31]
in contrast to the inference part of DNNs, typically still has announced a developer-accessible RISC-V-based SoC with
requires high-precision floating-point arithmetic. In contrast, custom hardware instructions to accelerate algorithms used for
inference works well with low-precision fixed-point arithmetic, 4G and 5G communication and signal processing. Our own
such as 16-bit [26], 8-bit, or even fewer bits while keeping approach is similar but takes a further step toward openness,
accuracy high [27]. As RRM mainly uses MLP and LSTM relying on: (i) open ISA; (ii) open-source cores and compilers;
layers [28], the latter being well known for quite high sensi- and (iii) open-source architecture [32].
tivity to numerical precision, we use for our implementation
the rather conservative 16-bit integer format, which requires B. ISA Extensions for Domain Specialization
only simple or even no quantization-aware training methods. The idea of extending general-purpose cores with
Models A and B both combine LSTM and MLPs. Model A application-specific instructions is not new. ARM and Intel
maximizes throughput by adapting dynamic channel selection, both offer various matrix computation and vector process-
carrier aggregation, and spectrum access under fairness con- ing extensions in high-performance oriented general-purpose
straints [11]; Model B focuses more on the dynamic spectrum processors. For example, ARM has introduced the AARCH64
access for network utility maximization [12]. Model C is also Neon extensions with the ARMv8-A processor series, includ-
an MLP and minimizes interference under latency constraints ing SIMD instructions for sum-dot-products (e.g., BFDOT)
via channel selections and power control [16]. Model D targets and 2 × 2 matrix–matrix multiplications (e.g., BFMMLA) with
a multichannel access problem for throughput maximiza- two-way SIMD in brain floating-point format bfloat16.
tion [20]. Models E and F tackle the problem of interference Note that this processor series comes in various microarchitec-
channel power control and throughput maximization [10], [14]. tures. For example, the CORTEX-A55 (ARMv8.2-A) has an
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
in-order superscalar microarchitecture, while, e.g., CORTEX- in parallel [38], [39]. CNN can exploit the im2col concept
A75 (ARMv8.2-A) incorporates an out-of-order superscalar of replicating and rearranging FMs for being formulated
pipeline. Intel’s out-of-order SkyLake SP processors extend as a matrix–matrix multiplication problem [38], [39]. This
the x86 ISA with 512-bit wide vector units called AVX512 well-researched reformulation enables the tiling of both the
enabling 16 × 32-bit SIMD vector operation for multipli- input and output FMs spatially in m×n-sized tiles, thereby
cation in single-precision float (FP16) and accumulations enabling the reuse of both weights and input FM pixels, which
in double-precision float (FP32). In 2019, Intel announced ultimately reduces the number of memory loads from O(mn)
its new ×86 microarchitecture called Cascade Lake, which to O(m + n). Since both the MLP and (nonconvolutional)
introduces the AVX512 Vector Neural Network Instructions LSTM layers are based on matrix–vector multiplications, this
(VNNI). AVX512 VNNI implements an 8-bit and 16-bit 2-D tiling cannot be reused.
fixed-point vector product with 32-bit internal accumula- In contrast to CNNs, RNNs require the use of non-
tion [33]. While Intel focuses mostly on the high-performance, linear activation functions, such as hyperbolic tangent and
high-cost processor market, ARM also offers microcon- sigmoid. As their transcendental computation is computa-
trollers in the low-cost and low-power range with the tionally complex, various acceleration approaches have been
Cortex-M family. Recently, ARM introduced the Cortex-M55, proposed: (i) piecewise linear approximation (PLA) [38];
an ultra-low-power in-order microprocessor with the Vector (ii) low-order Taylor series expansion (e.g., second order [40]);
Extensions MVE (Helium). The Helium instructions sup- (iii) lookup table (LUT) with adaptive value granularity [41];
port various single instruction–multiple data (SIMD) instruc- and (iv) small neural network [42]. For our extension, we apply
tions (INT8/16/32, FP16/32), hardware loops, and interleaved the PLA approach and exploit unlike other works the sym-
postincrement load/stores [34]. The Helium extension shows metry property of tanh and sig. In addition, we go a step
that introducing custom ISA extensions is not only beneficial further than, e.g., ARM’s CMSIS-NN library, and evaluate the
for high-performance general-purpose out-of-order cores and error introduced by different numbers of interpolation intervals
superscalar in-order cores but also for small, energy-efficient, and take the applied fixed-point quantization into account [38].
in-order cores with an IPC close to one.
Several academic proposals also introduce specialized III. ASIP-BASED RNN ACCELERATION
instructions: Neves et al. [35] introduce specialized SIMD
instructions optimized for biological sequence alignment algo- A. Baseline RISC-V Architecture
rithms. Guan et al. [36] propose a custom fast Fourier transfor- In this work, we propose an ASIP-based RNN RRM accel-
mation (FFT) instruction for increased throughput on orthog- eration system. We start with designing a single-core ASIP,
onal frequency-division multiplexing (OFDM)-based commu- which we then integrate into a cluster with up to N = 16
nication standards. Opportunities for ASIPs in 5G networks ASIPs. As a starting point for the ASIP, we use PULP’s
are surveyed in [37]. The authors also mention ASIPs for RI5CY core, a four-stage pipeline processor supporting the
ML acceleration as a future direction in 5G applications. RISC-V standard extended with custom instructions (i.e.,
Our work takes a major step in this direction. We focus on RV32IMFCXpulp ISA) [23]. We further extend the RI5CY
RISC-V ISA extensions for two key reasons. First, RISC-V core with a specialized dot product, hyperbolic tangent, and
is designed for extensibility with significant parts of the sigmoid instructions for efficient MLP and LSTM execu-
opcode space reserved for custom extensions. Second, with the tion [22]. We evaluate these custom instructions for throughput
growing community around the open and royalty-free RISC-V and energy efficiency on the selected RRM benchmark suite
ISA, the number of high-quality RISC-V-based open-source in an acceleration subsystem of up to 16 RNN-enhanced
cores and microcontroller systems has grown rapidly. Various RI5CY cores. An overview of the proposed architecture and its
open-source cores already support custom instructions, e.g., integration into a state-of-the-art RAN SoC is shown in Fig. 3.
the RI5CY core from the parallel ultra-low-power (PULP) In the right half, we show the proposed RRM acceleration
project supports custom xPULP instructions, such as SIMD, cluster. We perform the evaluations on a cluster configuration
HW loops, and postincrement loads [23]. In this work, we use with a single RNN ASIP, in which case the modules with a
the open RISC-V ISA and start as a baseline with a multicore dashed border in Fig. 3 would be removed and in a multicore
cluster system based on the mentioned open-source RI5CY configuration where all shown modules are used. For the
core. integration of the proposed ASIP-based acceleration subsystem
into a large 5G RAN SoCs, we propose to connect one of
C. Software Optimizations the proposed systems via a crossbar, e.g., by connecting it
The rise of previously described ISA extensions has led to the system crossbar in Marvell’s Octeon TX2 CN98xx
industry and academia to develop highly optimized SW ker- architecture [29], as shown in the left-hand side of Fig. 3.
nels, such as matrix–vector and matrix–matrix multiplications, The proposed RRM cluster is designed around a config-
which make the best use of these extensions. The used urable number (up to 16) of enhanced-RI5CY cores. The clus-
techniques mainly include the utilization of parallel SIMD ter has no data cache hierarchy in the traditional sense. Instead,
computations and the data reuse within the local register file all ASIP cores share a single-cycle accessible L1 tightly
with appropriate tiling for reduced memory data loads. The coupled data memory (TCDM) with a banking factor of 2,
latter has been commonly used for tiling the output FMs, composed of word-interleaved single-port SRAM macros as
where loaded IFMs can be reused to compute multiple outputs memory banks. The programmer is responsible to ensure that
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. ASIP RNN subsystem overview including its integration in a typical RAN SoC. The arrows in the RRM cluster show in the master-to-slave direction.
The modules with the dashed line as a border are only used for the multicore cluster configuration, while the modules with the solid lines are used for the
single-core and multicore cluster configurations.
the correct data are loaded via DMA from the L2 memory RNN layer. The final output of the network is computed from
before the ASIP cores try to access them. The required syn- the hidden state of the last RNN layer
chronization schemes are implemented by an event unit. We
use three different cluster configurations for our evaluations, yt = act(Why ht + b y ). (3)
in which we have one, eight, or 16 ASIPs and a 1-MB, 512-kB, LSTM neural networks [44] are a subclass of RNNs spe-
or 64-kB L1 memory. The L2 memory is mapped into the last cialized in learning both short- and long-term time-series
level cache (LLC) memory of the RAN SoC. dependencies. Besides the hidden state ht , LSTMs include an
A DMA engine is connected to the L1 memory via the internal cell state c = (c1 , c2 , . . . , c N H ). Their computation
logarithmic interconnect, to manage data transfers between include matrix–vector multiplication, pointwise vector–vector
L2 and the small L1 embedded within the RRM cluster. The additions/multiplications, and pointwise applied sigmoid and
DMA can be controlled by a single core from the cluster hyperbolic tangent activation functions
side and supports blocking and nonblocking 64-bit/cycle data
transactions in both directions concurrently. The ASIPs fetch it = σ (Wxi xt + Whi ht−1 + bi ) (4)
their instructions from a shared instruction cache. Multiport ft = σ (Wx f xt + Wh f ht−1 + b f ) (5)
memory banks are used for the shared tag and instruction c̃t = tanh(Wxc xt + Whc ht−1 + bc ) (6)
memory. The I-Cache has access to the 64-bit AXI cluster
bus to fetch off-cluster data from the L2 in the case of a ct = ft ct−1 + it c̃t (7)
cache-miss. The 64-bit AXI cluster bus also serves DMA data ot = σ (Wxo xt + Who ht−1 + bo ) (8)
transfers from L2 to L1 TCDM. Over a so-called peripheral ht = ot tanh(ct ) (9)
interconnect, the ASIPs can control and program the DMA
engine, event unit, or further peripherals, e.g., timers. including the input gate i, forget gate f, output gate o, and
the cell state c.
B. Neural RRM Models
The selected set of benchmarks is based on two DL models: C. Enhanced RISC-V ISA
MLPs and LSTM RNNs. MLPs include at least three layers:
In this section, we summarize our custom ISA extensions
one input, at least one hidden, and one output layer. These
used in our ASIP cluster. For more details, we refer the
individual layers, also called fully connected layers, are biased
interested reader to [22].
matrix–vector multiplications transforming N X input features
1) xPULP Extensions: The baseline RI5CY core already
xt at time t into N O output activations yt with the weight
supports various specialized HW instructions, such as SIMD,
Wx ∈ R N O ×N X and bias b y ∈ R N O
HW loops, and postincrement loads under the name xPULP,
yt = W x xt + b y . (1) which we leverage in the first optimization step.
RNNs [43] are a linear superposition of multiple FCLs 2) Tanh and Sigmoid Extension (HW): The two nonlinear
activated by a nonlinear activation function act (typically activation functions used for neural networks, such as LSTM
hyperbolic tangent or sigmoid function) with a feedback over networks, are sigmoid sig and hyperbolic tangent tanh.
the hidden state ht = (h 1 , h 2 , . . . , h N H ) with N H elements Their execution is often emulated in software with the help of
a linear approximation technique requiring multiple iterations
ht = act(Wxh xt + Whh ht−1 + bh ). (2)
until the required precision is reached. This emulation can
RNN networks can contain multiple RNN layers by feeding quickly become a major contributor to the overall execution
the hidden state of one RNN layer as input state to the next time of LSTM networks as, for example, Challita et al. [11]
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 4. RNN RISC-V core with extensions to RI5CY core [23] in blue and datapath for pl.sdotpsp instruction marked in bold lines.
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE III
C YCLE AND I NSTRUCTION C OUNT FOR THE E NTIRE RRM B ENCHMARK S UITE (RISCY IN B OLD AND N EW E XTENSIONS IN B LUE ). (a) W / O OPT
(RV32IMC). (b) +SIMD/HWL (X PULP ). (c) +O UT-FM T ILE ./ TANH / SIG . (d) + PL . SDOTSP I NSTRUCTION . (e) +I NPUT FM T ILING
Fig. 8. Speedup with respect to the RISC-V IMC baseline implementation for a typical neural networks workload in RRM.
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 9. Speedup on the evaluation cluster of a single FCL or LSTM layer for various IFM and OFM sizes. (a) FCL: speedup for two cores. (b) FCL:
speedup for four cores. (c) FCL: speedup for eight cores. (d) FCL: speedup for 16 cores. (e) LSTM: speedup for two cores. (f) LSTM: speedup for four
cores. (g) LSTM: speedup for eight cores. (h) LSTM: speedup for 16 cores.
TABLE V
A CHIEVED O P /C YCLE FOR THE L ARGER M ODELS ON THE L ARGE C LUSTER C ONFIGURATION AND THE S MALL C LUSTER C ONFIGURATION W ITH
AND W ITHOUT B ATCHING . T HE R ELATIVE D IFFERENCE I S C OMPARED TO THE A CHIEVED O PERATIONS /C YCLE ON THE
L ARGE C LUSTER C ONFIGURATION
[see Fig. 9(a)] or four cores [see Fig. 9(b)] in the cluster property for the N O -dependency can be observed better for
are enabled (while the others are shut off), the achieved eight or 16 active cores [see Fig. 9(c) and (d)]. Over the
speedup on the FCL kernel saturates for already rather small complete sweep of the FCL kernel, the highest speedups are
N I and/or N O toward its upper limit of SUideal = Ncores . 2×, 4×, 7.7×, and 13.8× for, respectively, two, four, eight,
In contrast, when eight or 16 cores are active, it needs a and 16 cores.
higher output FM size N O than an input FM size N I for The same properties can again be observed in Fig. 9(e)–(h)
the achieved speedup to incline toward the ideal speedup. for the sweep of the LSTM kernel over the IFM and OFM
These observations are aligned with the Saturation and the N I - sizes N I , N O ∈ {2, 34, . . . , 162}. The highest used FM size
independency property of our simplified model in Section IV. for the LSTM sweep is smaller than those used for the FCL
The latter property shows itself in the smooth increase in sweep since an LSTM kernel is larger than an FCL kernel and,
SUmeasured in N I -direction. The N O -dependency of the stated therefore, saturates faster the L1 memory of the evaluation
Saturation gets visible for all plots; however, the Optimality cluster configuration. Over the complete sweep of the LSTM
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 10. Speedup when using multiple cores for typical neural networks workload in RRM when the whole model fits into L1. The single-core speedup
corresponds to the achieved speedup with all single-core implementations.
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 12. Layerwise breakdown of the computation time of each layer, including the DMA transactions time for model E [10]. (a) Model E with tiling.
(b) Model E with tiling and batching.
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VI R EFERENCES
C OMPARISON W ITH R ELATED W ORK
[1] N. D. Tripathi, J. H. Reed, and H. F. VanLandingham, Radio
Resource Management in Cellular Systems, vol. 618. Springer, 2006.
[Online]. Available: https://books.google.ch/books?hl=de&lr=&id=5dc-
AAAAQBAJ&oi=fnd&pg=PP13&dq=Radio+Resource+Management+in
+Cellular+Systems&ots=6Re5ZlU6Ru&sig=s23Kqqs9Y6MP6ycAOW0
z5a1YHXE#v=onepage&q=Radio%20Resource%20Management%20in
%20Cellular%20Systems&f=false
[2] S. Manap, K. Dimyati, M. N. Hindia, M. S. A. Talip, and R. Tafazolli,
“Survey of radio resource management in 5G heterogeneous networks,”
IEEE Access, vol. 8, pp. 131202–131223, 2020.
[3] M. Naeem, K. Illanko, A. Karmokar, A. Anpalagan, and M. Jaseemud-
din, “Optimal power allocation for green cognitive radio: Fractional
programming approach,” IET Commun., vol. 7, no. 12, pp. 1279–1286,
Aug. 2013.
[4] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted
MMSE approach to distributed sum-utility maximization for a MIMO
interfering broadcast channel,” in Proc. IEEE Int. Conf. Acoust., Speech
model), the compute throughput of 11 TOp/s is dispropor- Signal Process. (ICASSP), May 2011, pp. 3060–3063.
tionately large. The vector processor supports floating-point [5] K. I. Ahmed, H. Tabassum, and E. Hossain, “Deep learning for radio
operations, offering more numerical precision than needed resource allocation in multi-cell networks,” IEEE Netw., vol. 33, no. 6,
pp. 188–195, Nov. 2019.
by our applications, resulting in an increased area cost and [6] A. Hannun et al., “Deep speech: Scaling up end-to-end
lower energy efficiency than our RRM ASIP. The hardwired speech recognition,” 2014, arXiv:1412.5567. [Online]. Available:
accelerator provides the highest energy efficiency; however, https://arxiv.org/abs/1412.5567
[7] H. Ze, A. Senior, and M. Schuster, “Statistical parametric speech
hardwired accelerators are not flexible to adapt to the rapidly synthesis using deep neural networks,” in Proc. IEEE Int. Conf. Acoust.,
changing RRM field as new algorithms would require a Speech Signal Process., May 2013, pp. 7962–7966.
costly HW redesign [37]. Even though our implementation [8] Y. Wu et al., “Google’s neural machine translation system: Bridging
the gap between human and machine translation,” pp. 1–23, 2016,
focuses on 16 bits instead of 8 bits, on a single-ASIP system, arXiv:1609.08144. [Online]. Available: http://arxiv.org/abs/1609.08144
we achieve >2× higher OP/cycle than a comparable commer- [9] M. Zanghieri, S. Benatti, A. Burrello, V. Kartsch, F. Conti, and
cially available dual-issue core (STM32H743). Adapting our L. Benini, “Robust real-time embedded EMG recognition framework
using temporal convolutional networks on a multicore IoT processor,”
HW and SW optimizations for 8-bit could be promising for IEEE Trans. Biomed. Circuits Syst., vol. 14, no. 2, pp. 244–256,
less arithmetic-sensitive applications. Overall, our customized Apr. 2020.
ISA instructions are tailored to the needs of the targeted [10] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
“Learning to optimize: Training deep neural networks for wireless
LSTM and MLP RRM models while maintaining flexibility. resource management,” in Proc. IEEE 18th Int. Workshop Signal
In addition, the number of compute units is adequately scaled Process. Adv. Wireless Commun. (SPAWC), Jul. 2017, pp. 1–6.
to the average benchmark size. [11] U. Challita, L. Dong, and W. Saad, “Proactive resource management
for LTE in unlicensed spectrum: A deep learning perspective,” 2017,
arXiv:1702.07031. [Online]. Available: http://arxiv.org/abs/1702.07031
VI. C ONCLUSION [12] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for
distributed dynamic spectrum access,” IEEE Trans. Wireless Commun.,
In this work, we have identified a selected set of recently vol. 18, no. 1, pp. 310–323, Jan. 2019.
proposed real-world RRM-targeted benchmarks based on [13] M. Eisen, C. Zhang, L. F. O. Chamon, D. D. Lee, and A. Ribeiro,
“Learning optimal resource allocations in wireless systems,” IEEE Trans.
MLPs and RNNs. Starting from a baseline, simple RISC-V Signal Process., vol. 67, no. 10, pp. 2775–2790, May 2019.
core, we first introduce instruction extensions coupled with [14] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learn-
software optimizations for the selected RRM benchmarks ing for dynamic power allocation in wireless networks,” 2018,
arXiv:1808.00490. [Online]. Available: http://arxiv.org/abs/1808.00490
and evaluate them on a single-core and multicore cluster [15] W. Lee, M. Kim, and D.-H. Cho, “Deep power control: Transmit power
acceleration system. For the single-core acceleration system, control scheme based on convolutional neural network,” IEEE Commun.
we demonstrate an energy efficiency of 218 GMAC/s/W and a Lett., vol. 22, no. 6, pp. 1276–1279, Jun. 2018.
[16] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allocation
throughput of 566 MMAC/s corresponding to an improvement in V2V communications,” in Proc. IEEE Int. Conf. Commun. (ICC),
of 10× and 10.6×, respectively, over the single-core system May 2018, pp. 1–6.
with a baseline RV32IMC core. For the multicore acceleration [17] M. Yao, M. Sohul, V. Marojevic, and J. H. Reed, “Artificial intelligence
defined 5G radio access networks,” IEEE Commun. Mag., vol. 57, no. 3,
system, we analyze the parallel speedup dependency on the pp. 14–20, Mar. 2019.
input and output FM size for FCLs and LSTM layers, achiev- [18] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement
ing up to 10.2× speedup with 16 cores over a single extended learning approach to power control and rate adaptation in cellular
networks,” in Proc. IEEE Int. Conf. Commun. (ICC), May 2017, pp. 1–7.
RI5CY core for single LSTM layers and a speedup of 13.8× [19] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning mul-
for single FCL. On the full RRM benchmark suite, we achieve tiple access for heterogeneous wireless networks,” IEEE J. Sel. Areas
an average overall speedup of 16.4×, 25.2×, 31.9×, and Commun., vol. 37, no. 6, pp. 1277–1290, Jun. 2017.
[20] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforce-
38.8× on two, four, eight, and 16 cores, respectively, compared ment learning for dynamic multichannel access in wireless networks,”
to our single-core RV32IMC baseline implementation. IEEE Trans. Cognit. Commun. Netw., vol. 4, no. 2, pp. 257–265,
Jun. 2018.
[21] International Organization for Standardization/International Electrotech-
ACKNOWLEDGMENT nical Commission and others, Information Technology—Open Sys-
tems Interconnection—Basic Reference Model: The Basic Model,
The authors thank Matteo Spallanzani for the valuable Standard ISO/IEC 7498-1:1994, 1994, vol. 427. [Online]. Available:
discussions. https://www.iso.org/standard/20269.html
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[22] R. Andri, T. Henriksson, and L. Benini, “Extending the RISC-V ISA [47] S. Yin et al., “A high energy efficient reconfigurable hybrid neural
for efficient RNN-based 5G radio resource management,” in Proc. 57th network processor for deep learning applications,” IEEE J. Solid-State
ACM/IEEE Design Autom. Conf. (DAC), Jul. 2020, pp. 1–6. Circuits, vol. 53, no. 4, pp. 968–982, Apr. 2018.
[23] M. Gautschi et al., “Near-threshold RISC-V core with DSP extensions [48] Jetson AGX Xavier Developer Kit Nvidia Corporation, Santa
for scalable IoT endpoint devices,” IEEE Trans. Very Large Scale Integr. Clara, CA, USA, 2019, pp. 1–36. [Online]. Available: https://
(VLSI) Syst., vol. 25, no. 10, pp. 2700–2713, Oct. 2017. developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit
[24] A. Pullini, D. Rossi, I. Loi, G. Tagliavini, and L. Benini, “Mr.Wolf: [49] M. Cavalcante, F. Schuiki, F. Zaruba, M. Schaffner, and L. Benini, “Ara:
An energy-precision scalable parallel ultra low power SoC for IoT edge A 1-GHz+ scalable and energy-efficient RISC-V vector processor with
processing,” IEEE J. Solid-State Circuits, vol. 54, no. 7, pp. 1970–1981, multiprecision floating-point support in 22-nm FD-SOI,” IEEE Trans.
Jul. 2019. Very Large Scale Integr. (VLSI) Syst., vol. 28, no. 2, pp. 530–543,
[25] I. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Feb. 2020.
Process. Syst., vol. 30, pp. 5998–6008, 2017. [50] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi, and
[26] D. Lin, S. Talathi, and S. Annapureddy, “Fixed point quantization of F. Conti, “DORY: Automatic end-to-end deployment of real-world
deep convolutional networks,” in Proc. Int. Conf. Mach. Learn., 2016, DNNs on low-cost IoT MCUs,” Tech. Rep., Aug. 2020. [Online].
pp. 2849–2858. Available: https://ieeexplore.ieee.org/abstract/document/9381618
[27] B. Jacob et al., “Quantization and training of neural networks for
efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf. Gianna Paulin (Student Member, IEEE) received
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704–2713. the B.Sc. and M.Sc. degrees in electrical engineering
[28] G. L. Santos, P. T. Endo, D. Sadok, and J. Kelner, “When 5G meets and information technology from the Swiss Fed-
deep learning: A systematic review,” Algorithms, vol. 13, no. 9, p. 208, eral Institute of Technology, ETH Zürich (ETHZ),
Sep. 2020. Zürich, Switzerland, in 2017 and 2019, respectively,
[29] Marvell OCTEON TX2 DPDK Overview, Marvell, Hamilton, Bermuda, where she is working toward the Ph.D. degree at the
2020. Integrated Systems Laboratory.
[30] M. Yang, Y. Li, D. Jin, L. Su, S. Ma, and L. Zeng, “OpenRAN: Her main interests lay in reduced precision deep
A software-defined ran architecture via virtualization,” ACM SIGCOMM learning from the algorithmic and hardware acceler-
Comput. Commun. Rev., vol. 43, no. 4, pp. 549–550, 2013. ation aspect with a focus on time series applications
[31] EdgeQ. Accessed: Feb. 24, 2021. [Online]. Available: www.edgeq.io and low power embedded systems.
[32] OpenHW. Accessed: Feb. 24, 2021. [Online]. Available: www.
openhwgroup.org
[33] Intel Corp. (2019). Intel-Architecture Instruction Set Extensions Renzo Andri (Member, IEEE) received the B.Sc.,
and Future Features Programming Reference. [Online]. Available: M.Sc., and Ph.D. degrees in electrical engineer-
https://software.intel.com/en-us/download/intel-architecture-instruction- ing and information technology from ETH Zürich,
set-extensions-and-future-features-programming-reference Zürich, Switzerland, in 2013, 2015, and 2020,
[34] J. Yiu, “Introduction to Armv8.1-M architecture,” ARM, White Paper, respectively.
Feb. 2019, pp. 1–14. [Online]. Available: https://pages.arm.com/rs/312- He is currently a Senior Researcher with the
SAX-488/images/Introduction_to_Armv8.1-M_architecture.pdf Computing Systems Laboratory, Huawei Technolo-
[35] N. Neves, N. Sebastiao, D. Matos, P. Tomas, P. Flores, and N. Roma, gies, Zurich Research Center, Zürich. His research
“Multicore SIMD ASIP for next-generation sequencing and alignment focuses on energy-efficient machine learning accel-
biochip platforms,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., eration from embedded system design to full-custom
vol. 23, no. 7, pp. 1287–1300, Jul. 2015. IC design.
[36] X. Guan, Y. Fei, and H. Lin, “Hierarchical design of an application- Dr. Andri won the IEEE TCAD Donald O. Pederson Award in 2019.
specific instruction set processor for high-throughput and scalable FFT
processing,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20,
no. 3, pp. 551–563, Mar. 2012. Francesco Conti (Member, IEEE) received the
[37] S. Shahabuddin, A. Mammela, M. Juntti, and O. Silven, “ASIP for 5G Ph.D. degree in electronic engineering from the
and beyond: Opportunities and vision,” IEEE Trans. Circuits Syst. II, University of Bologna, Bologna, Italy, in 2016.
Exp. Briefs, vol. 68, no. 3, pp. 851–857, Mar. 2021. From 2016 to 2020, he was a Post-Doctoral
Researcher with the Integrated Systems Labora-
[38] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient neural network
tory, Digital Systems Group, ETH Zürich, Zürich,
kernels for arm cortex-M CPUs,” in Proc. Int. Conf. Hardw./Softw.
Switzerland. He is currently an Assistant Pro-
Codesign Syst. Synth., 2018, pp. 1–2.
fessor with the DEI Department, University of
[39] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “PULP- Bologna. He focuses on the development of
NN: Accelerating quantized neural networks on parallel ultra-low-power deep learning-based intelligence on top of ultra-
RISC-V processors,” Phil. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., low power, ultra-energy efficient programmable
vol. 378, no. 2164, Feb. 2020, Art. no. 20190155. Systems-on-Chip—from both the hardware and software perspective.
[40] C.-W. Lin and J.-S. Wang, “A digital circuit design of hyperbolic tangent Dr. Conti’s work has resulted in more than 40 publications in international
sigmoid function for neural networks,” in Proc. IEEE Int. Symp. Circuits conferences and journals and has been awarded several times, including the
Syst., May 2008, pp. 856–859. 2020 IEEE TCAS-I Darlington Best Paper Award.
[41] K. Leboeuf, A. H. Namin, R. Muscedere, H. Wu, and M. Ahmadi, “High
speed VLSI implementation of the hyperbolic tangent sigmoid function,”
in Proc. 3rd Int. Conf. Converg. Hybrid Inf. Technol., vol. 1, 2008,
pp. 1070–1073.
[42] C.-H. Tsai, Y.-T. Chih, W. H. Wong, and C.-Y. Lee, “A hardware- Luca Benini (Fellow, IEEE) has served as the Chief
efficient sigmoid function with adjustable precision for a neural network Architect for the Platform2012 in STMicroelectron-
system,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 62, no. 11, ics, Grenoble, France. He is currently the Chair
pp. 1073–1077, Nov. 2015. of Digital Circuits and Systems with ETH Zürich,
[43] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- Zürich, Switzerland, and a Full Professor with the
resentations by back-propagating errors,” Nature, vol. 323, no. 6088, University of Bologna, Bologna, Italy. He is also
pp. 533–536, Oct. 1986. active in the area of energy-efficient smart sensors
[44] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural and sensor networks. He has published more than
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. 1000 articles in peer-reviewed international journals
[45] É. F. Zulian, G. Haugou, C. Weis, M. Jung, and N. Wehn, “System and conferences, four books, and several book chap-
simulation with PULP virtual platform and SystemC,” in Proc. Conf. ters. His research interests are in energy-efficient
Rapid Simulation Perform. Eval. Methods Tools, Jan. 2020, pp. 1–7. system and multicore System-on-Chips (SoC) design.
[46] (2018). GreenWaves Technologies Unveils GAP8 Processor for AI at the Dr. Benini is also a Fellow of the ACM and a member of the Academia
Edge. [Online]. Available: https://venturebeat.com/ Europaea.
Authorized licensed use limited to: University of Glasgow. Downloaded on August 13,2021 at 22:20:41 UTC from IEEE Xplore. Restrictions apply.