Di Mascio Et Al 2021 On Board Decision Making in Space With Deep Neural Networks and Risc V Vector Processors
Di Mascio Et Al 2021 On Board Decision Making in Space With Deep Neural Networks and Risc V Vector Processors
readily available, is more difficult and thus still rare. This paper analyzes the impact of DNNs on the system-level
capabilities of space systems in terms of on-board decision making (OBDM) and identifies the specific criticalities of
deploying DNNs on satellites. The workload of DNNs for on-board image and telemetry analysis is analyzed, and the
results are used to drive the preliminary design of a RISC-V vector processor to be employed as a generic platform to
enable energy-efficient OBDM for both payload and platform applications. The design of the memory subsystem is
carried out in detail to allow full exploitation of the computational resources in typically resource-constrained space
systems.
during the Very High Speed Integrated Circuit Hardware Description small satellite in LEO to generate data and its capability to transmit
Language (VHDL) implementation of a RISC-V vector processor for data to the ground.
space applications based on the NOEL-V platform (developed by
Cobham Gaisler) [14]. 1. Benefits of Data Removal and Compression
RISC-V is an instruction set architecture (ISA) that is rapidly
growing in popularity in both terrestrial and space applications [4]. Given the tight power budgets and the expensive hardware
Its main characteristics are simplicity, openness (being a free and required for on-board data processing, data processing is typically
open standard allows open-source implementations), and modularity executed on ground. For instance, noise filtering can be executed on
(i.e., composed of a base ISA and many optional ISA extensions). ground with cheaper hardware. On the other hand, sometimes on-
Among the many ISA extensions defined in the standard, the RISC-V board data processing provides an advantage over on-ground data
Vector Extension (RVVE) is being proposed to provide general processing in terms of satellite performance. For instance, data
support for data-parallel execution [15]. compression is already deployed in many missions (e.g., in [22] a
The paper starts by analyzing the benefits that DNNs can provide at 2:1 compression is employed) because it mitigates the bottleneck of
the system level and the feasibility of the deep learning approach for the downlink. The efficiency of the downlink can be increased even
space applications (Sec. II). Then, an analysis of the software work- further, removing useless data instead of sending it to the ground
loads required for DNNs is carried out in Sec. III. The information (i.e., data removal [23]). For instance, in the Landsat datasets [24], the
collected is then used to define a suitable hardware platform (Secs. IV average cloud cover in an archived scene is 34%, with 38% of the
and V). To account for both computational and memory constraints, scenes containing less than 10% cloud cover. Therefore, selecting
separate discussions are carried out for the microarchitecture of the only images with less than 10% of cloud cover results in an average of
processing core (Sec. IV) and its memory subsystem (Sec. V). 2.63× data reduction. Combining data removal with a 2:1 compres-
Finally, Sec. VI concludes with a summary of the main findings sion, the amount of useful data sent increases by 5.26× compared
and several recommendations to systematically enable OBDM with with a system without on-board data processing.
RISC-V vector processors in the medium-term.
2. Cost of Required Hardware
When DNNs and other data processing algorithms are to be
II. Impact at System Level deployed on data produced by instruments, a payload processor is
The focus of the space industry in recent years shifted from large required to process the data. Although memories with long retention
geostationary orbit (GEO) satellites to small (< 500 kg) low-Earth- time and low power dissipation (e.g., flash memories) can be
orbit (LEO) satellites (especially CubeSats) [16,17]. employed for mass memories, faster memories are required to act
While GEO satellites can continuously communicate with the as main memory of the payload processors. Typically dynamic
ground station, LEO satellites can only communicate with the ground random-access memory (DRAM) arrays are chosen, ranging from
station periodically, sometimes with large periods between contacts single data rate (SDR) to double data rate 2 (DDR2) to double data
[18]. In this way, the satellite may enter an unsafe state and the ground rate 3 (DDR3), depending on the radiation resilience/performance
operator in the worst case can only intervene hours later. tradeoff required [25]. From the datasheet [26] of the 1 Gb DDR2
However, there is a trend of launching LEO satellites in constella- DRAM tested in [25], a peak power consumption of around 0.5 W
tions and mega constellations [19], with the possibility of mitigating can be taken as an estimation of power consumption, and 1 W for the
the risk of failure of a single satellite and replace them if they fail most powerful version of the vector processor in [27] running a
(as they are much cheaper than large GEO satellites). There is there- peak-performance application. Assuming a requirement of 1 GiB
fore a tradeoff to be made between dependability of a single satellite, of main memory, we consider 5 W as the cost in terms of power PP of
its cost, and number of spare satellites. applying data reduction and compression. As a comparison, 1U
Furthermore, space systems are inherently constrained in terms of CubeSats and 3U CubeSats in [20] generate, respectively, 1–2 W
power available (e.g., only a limited surface is available to collect and 5–6 W, whereas the 6U CubeSat in [22] generates around 20 W.
power). Limited power implies that the data rate of the downlink Assuming a common amount of power allocated for the trans-
given a certain target bit error rate (BER) is also limited, as the data mission and data processing subsystems (PTP ), we can estimate
rate is proportional to the power employed during the transmis- the amount of useful data transmitted per station contact DC when
sion [20]. data are not processed as DC PTP ∕RR k, where k is a
Therefore, small satellites in LEO pose new challenges both in constant (dependent on the transmission subsystem, receiver,
terms of amount of data that can be transmitted to the ground and in propagation, and required BER) [20] and RR is the optimal
terms of dependability. In the following two subsections we will show removal rate, i.e., the ratio between useful data and data produced
how OBDM can help mitigating these shortcomings of LEO satel- by the payload. When only useful data are selected and a data
lites. In Sec. II.C the feasibility of applying DNNs to these problems compression of CR:1 is applied, the amount of useful data trans-
is investigated. mitted is instead DC CR PTP − PP k. The ratio R between
DI MASCIO ET AL. 555
B. Virtual Operator
In [21] it is assumed that a LEO satellite has an orbit duration of
90 min and that there is a contact with the ground station either 5 min
every orbit (6% of the orbital period) or every 5 orbits (1%). In a
similar scenario, the idea of an on-board virtual operator monitoring
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
the past telemetry to train the network on ground and then uplink the public code††† of CloudNet [34]. It is a fully convolutional network
trained network in software. When the telemetry forecasting is to be (FCN) [35] for cloud detection; i.e., its output is a mask of the same
deployed on a constellation composed of replicas of the same size of the input image indicating the pixels covered with clouds. The
satellite, more statistics for larger datasets are available. As reported use of an FCN instead of a CNN helps in mapping efficiently the
in [19], existing and planned constellations comprise hundreds to DNN in resource-constrained hardware, as it is possible to work on
thousands satellites (e.g., 4200 for the planned constellation from patches of a large image without the need of working on the entire
Samsung), thus making DNNs potentially very effective also for image. The fraction of bits covered in clouds can then be averaged on
mission-specific parameters. the ∼400 patches. In the case of CloudNet four spectral bands of the
large images of Landsat 8 (e.g., 7621 × 7791 pixels) are divided in
nonoverlapping patches of 384 × 384 pixels, which are then down-
III. Workloads Analysis sampled to 192 × 192 pixels.
Analyzing the model in Keras,‡‡‡ we find that CloudNet contains
The run time of compute-intensive workloads composed of a 38 two-dimensional convolutional layers (of which 5 are transposed),
certain amount of floating point calculations is typically expressed 15 addition layers, 31 batch normalization layers, 45 standalone
in terms of number of floating point operations (FLOPs) per second activation layers, and 53 concatenate layers. To give an idea of the
(FLOP/s) or number of FLOPs per clock cycle (FLOP/CC).*** The contribution of each of these layers, we profiled the execution of the
number of FLOP/CC that can be achieved by a certain hardware model on a quad-core Intel i7-6600U. The breakdown of the execu-
platform has an upper bound defined by the number of functional tion type for each type of layer is shown in Fig. 3, and considerations
units and the amount of operations these units can perform simulta- on each of them are carried out in the remainder of this section.
neously. We call this upper bound maximum theoretical performance Furthermore, running a single inference per time requires a peak
per clock cycle (MTPCC ). MTPCC is independent of any other memory of 836.65 MiB. This value is compatible with values found
microarchitectural feature, like instruction-level parallelism (ILP), in literature for other DNNs, typically ranging from 645 MiB to 1.49
speculation, and caching. However, it is not possible to achieve GiB [36].
#FLOP∕CC ≈ MTPCC for every workload, as data are to be fetched
from memory, and in some cases this cannot be done fast enough to
1. Convolutional Layers
keep the functional units busy all the time. To visualize whether a
workload can achieve the MTPCC (compute-bounded workloads) or As shown in Fig. 4, applying a convolutional layer with N kernels,
the performances are bound from the memory bandwidth (memory- each of dimensions C × J × K, kernels to an input of dimensions C ×
bounded workloads), the roofline model was introduced in [32]. W × H generates an output of N matrices, each of dimensions U × V
According to this model, the fraction of MTPCC that can be achieved [37], with U and V depending on the stride S and padding P of the
by a workload on a certain platform depends on the operational convolutional layer with the equation (an analogous relationship
intensity (OI) of the workload, which is holds replacing W, J, and U with, respectively, H, K, and V) [38]:
where MT is the memory traffic composed of the read traffic RT plus As straightforward software implementations of convolutions
the write traffic WT. For each hardware platform there is an OI for achieve low performance, performances are typically improved
which workloads are memory-bounded if OI < OI (therefore the unrolling the convolutions into matrix–matrix multiplications [39].
performances are limited to #FLOP∕CC BW OI, where BW is In this case, the number of FLOPs for each layer is estimated as
the bandwidth of the memory) and compute-bounded if OI > OI #FLOP 2UVNCJK, given that there are UVN output elements
(where achieving the MTPCC is actually possible with microarchi- and for each of them CJK multiplications and accumulations are
tecture and software optimizations). Although based on several required. The read traffic is then RT 4NCJK UVCJK and the
assumptions, for instance, that it is possible to overlap memory write traffic is WT 4NUV. In Table 1 we show the size of the unroll
transfers and computations [33]), the roofline model is a successful of the convolution for only the first 15 layers (for sake of brevity) of
tool to benchmark processors in an application-independent way, the network. Some observations can be made:
mainly focusing on the performance of popular kernels (e.g., [27]). 1) OIs are large (in the order of tens of FLOP/B), except for
convolutions with K 1 for which OI can go down to 1.60 FLOP/B.
¶¶
https://goce-ds.eo.esa.int/oads/access/collection/GOCE_Telemetry/.
†††
***Normalizing by frequency is a common procedure to obtain technol- https://github.com/SorourMo/Cloud-Net-A-semantic-segmentation-
ogy-independent metrics that measure the effectiveness of a certain micro- CNN-for-cloud-detection.
‡‡‡
architecture. https://keras.io.
DI MASCIO ET AL. 557
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
Convolution 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
WH 192 192 192 192 96 96 96 48 48 48 24 24 24 12 12
C 4 16 16 32 32 32 64 64 64 128 128 128 256 256 512
KJ 3 3 3 1 3 3 1 3 3 1 3 3 1 3 3
N 16 32 16 32 64 32 64 128 64 128 256 128 256 512 512
UV 192 192 192 192 96 96 96 48 48 48 24 24 24 12 12
RT [MiB] 5.06 20.27 20.26 40.54 10.20 10.16 20.39 5.34 5.20 10.69 3.66 3.09 7.31 5.77 11.53
WT [MiB] 2.25 4.50 2.25 4.50 2.25 1.13 2.25 1.13 0.56 1.13 0.56 0.28 0.56 0.28 0.28
MT [MiB] 7.31 24.77 22.51 45.04 12.45 11.29 22.64 6.47 5.77 11.81 4.22 3.38 7.88 6.05 11.81
#MFLOP 42.5 340 170 75.5 340 170 75.5 340 170 75.5 340 170 75.5 340 679
OI [FLOP/B] 5.54 13.08 7.20 1.60 26.03 14.36 3.18 50.09 28.10 6.10 76.80 48.00 9.14 53.58 54.86
2) Even if OI is large and therefore the workloads can be assumed n21 1 2n2
to be compute-bounded, the absolute amount of memory traffic is OI
8n21 n1 n2
very high (3–45 MiB per layer). These values require a dedicated
design of the memory subsystem compared with processors for non-
compute-intensive workloads, which will be carried out in Sec. V. Assuming that 2n2 ≫ 1, OI ≈ n1 n2 ∕4n1 n2 ,which given a
3) The memory traffic is for a large majority composed by reads certain memory traffic (i.e., n1 n2 const) is maximized for
(92.83% in average). n1 n2 , reaching OI ≈ n1 ∕8. As OI is proportional to the size of
Further performance enhancements can be obtained by mapping the output matrix, SGEMM will eventually achieve the peak perfor-
the matrix–matrix multiplication with optimized libraries. In [39] it is mance for a large enough matrix on a given hardware platform. For
shown that using basic linear algebra subroutines (BLAS) instead of this reason, the SGEMM efficiency
coding the unrolled version from scratch produces a speed-up rang-
ing from 2.43× to 3× depending on the architecture and on the input FLOP∕CC
ESGEMM
size. Using BLAS subroutines, matrix–matrix multiplications are MTPCC
mapped to the SGEMM subroutine,§§§ which (in its nontransposed
form) implements the following algorithm: (i.e., the fraction of time the functional units of the processor are busy
when executing SGEMM) is typically given as a measure of attain-
able performance on a certain hardware platform [40]. When caching
A2←αA0 × A1 βA2 (3) levels are present, increasing the size of the matrix multiplications to
increase OI will eventually cause a drop in performance, as the
operands will not fit anymore in the cache level responsible of peak
performance and reads from lower levels (even main memory) are
where A0, A1, and A2 are matrices of, respectively, size n1 × n2 ,
required during the matrix multiplication, breaking the assumption of
n2 × n3 , and n1 × n3 , and α and β are scalars. Assuming α β 1
the roofline model that memory traffic and computation overlap. This
(as in the case of convolutions) and a square matrix at the output
issue is analyzed in Sec. V.
(n1 n3 ), SGEMM has
§§§
2. Concatenate and Addition
Analogous subroutines are defined for different data types, and the first
letter represents the data type. For instance, SGEMM is for single precision Given that CloudNet is very deep (38 convolutional layers), it
(SP), DGEMM is for double precision (DP), and IGEMM is for integers. In requires specific solutions in its architecture to mitigate the vanishing
this paper, data will be assumed to be SP unless specified otherwise; therefore gradient problem [41]. The designers of CloudNet handled this
SGEMM will be used. problem using skip connections, and addition and concatenation
558 DI MASCIO ET AL.
layers [34]. As can be seen in Fig. 3, although the impact of addition Therefore, OI reaches its maximum (0.5 FLOP/B) for very large
layers on the execution time is negligible (1.1%), concatenation CHW and N. To give an idea of how FC layers compare against
layers take a considerable part of the execution time (24.7%). Fur- convolutional layers, we compared the memory traffic and OI for the
thermore, concatenate operations contain no FLOPs and consist convolutional layers in Table 1 to FC layers with same C, W, H, and
mainly of memory transfers; therefore they cannot be sped up with N. The MT of the FC layers ranges between 1.31× and 18.36×
increased computation capabilities. These considerations suggest compared with the respective convolutional layer, whereas the OI
avoiding architectures concatenation and using skip connection is 3.3× to 154.2× smaller. The high MT associated with FC layers is
between layers with same dimensions (where concatenations are confirmed by [37], although a trend can be noticed: for early CNNs
not needed), as done in [41]. with few convolutional layers (e.g., AlexNet with five convolutional
layers and three FC layers) the percentage of parameters in the FC
3. Batch Normalization layers is very high (for AlexNet 96.07%), whereas state-of-the-art
Batch normalization layers are employed to speed up training and deeper CNNs (typically achieving higher accuracy) like ResNet have
increase accuracy of DNNs [42]. This type of layer also acts as a many convolutional layers (for ResNet the number of convolutional
regularizer, keeping the magnitude of the parameters low and avoid- layers ranges from 53 to 155 and typically only one FC layer is
ing overfitting [42,43]. When using batch normalization during present) and have a much lower percentage of parameters in the FC
inference, each element xi of the activation vector x from the previous layers (ranging, respectively, from 8.04 to 3.42%).
layers is normalized according to Furthermore, the performance for FC layers can be improved
employing batching, i.e., processing more input features in paral-
xi − Exi lel [37]. This technique is particularly effective in the case of
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
B. Other Layers in DNNs for Image Analysis C. DNNs for Telemetry Forecasting
When DNNs are employed for classification, the expected output Recurrent neural networks (RNNs) are typically employed in time
is typically a vector containing the probability of classification for a series analysis like speech recognition and natural language process-
certain object. In these cases, the last layers of the DNN after the ing (NLP) [47–49], and they can be applied, for instance, to early
convolutional layers are composed of fully connected (FC) layers to failure detection or to predict the telemetry of the next orbit given the
make a decision based on the information contained in groups of telemetry of previous orbits, as done in [28]. RNNs are composed by
pixels. This type of network is usually called CNN. FC layers can be a cascade of units with internal feedback, where each unit requires the
seen as convolutional layers where there is no sharing of coefficients, output of the previous one to be ready to calculate the next activation.
i.e., J K W H [37]. This implies that the output is a vector of Typically long short-term memory (LSTM) implementations are
size N, the number of operations is #FLOPs 2NCHW, the chosen to achieve higher accuracy, whereas gated recurrent unit
memory traffic is MT 4CHWN 1 N, and (GRU) implementations provide lower accuracy with higher
1
OI ¶¶¶
Batching is instead not effective with convolutions, as the amount of
21 1∕N 1∕CHW parameters in a convolution is very small (e.g., 3 × 3 × 16).
DI MASCIO ET AL. 559
a) No FMA, D=1, P=1 b) FMA, D=1, P=1 c) FMA, D=4, P=1 d) FMA, D=4, P=4
Fig. 5 Steps to increase the MTPCC of space processors in a power- and area-efficient way.
performance [28]. Furthermore, one or more FC layers are placed power consumption of a general-purpose scalar processor is spent on
before the output [28]. fetching instructions. For instance, the breakdown of energy dissi-
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
LSTM layers are typically memory-bounded [50]. Similarly to pation on a scalar processor executing IGEMM in [5] shows that the
[28], the linear part of the LSTM layer can be described as instruction cache dissipates 19.63% of the total energy, instruction
fetch and decode stages 4.69%, and the virtual memory (comprising
st W ⋅ xt U ⋅ ht−1 b (5) both instruction and data) 7.41%. A percentage of energy dissipation
ranging between around 24 and 32% can therefore be attributed to
where xt , ht−1 , and st are column vectors, respectively, of length m, n, instructions fetching and decoding. Data parallel processors reduce
and n. W and U are, respectively, [m × n] and [n × n]. Therefore the this fraction of power, defining instructions that operate on arrays of
#FLOPs is 2n2 n nm, the MT seen by main memory is D elements instead of scalar elements. Figure 5c shows an example
4n2 nm 3n m, and the OI is with D 4, which (together with FMA operations) achieves
MTPCC 8 FLOP/CC. However, DLP is the least flexible form of
#FLOP parallelism [53], as it can only be applied to calculations that can be
OI (6)
2#FLOP 2n m vectorized (i.e., expressed with instructions on vectors), e.g., matrix–
matrix multiplications in convolutional layers. As a matter of fact, in
with a maximum value of 0.5 FLOP/B for large matrices, i.e., [54], the speed-up found in the convolutional layers of a CNN using
#FLOP ≫ 2n m. This low value can be increased with batching, the data-parallel NEON extension over the baseline ARM ranges
as it turns matrix–vector multiplications into more computational from 2.45× to 2.78×, with a decrease of energy consumption per
intensive matrix–matrix multiplications (as B vectors are put together convolutional layer ranging from 59.11 to 82.04%. The energy
to create a matrix of dimensions [n × B]). In this case, efficiency of the data-parallel solution (i.e., performance in terms
of executed layers per amount of energy) is in this case then 5.98× to
#FLOP B 15.50× the energy efficiency of the non-data-parallel baseline. When
OI (7) the effectiveness of DLP saturates for large D, the solution left to
2#FLOP 3B − 1n m B
increase the MTPCC is to replicate the processing core. In Fig. 5d the
This equation shows that the efficacy of batching in terms core is replicated four times (P 4), achieving an MTPCC of 32
of increase of OI saturates as B grows, until the upper bound FLOP/CC (together with FMA and D 4). Going above four cores
of OImax #FLOP∕6n 2m is achieved. This upper bound is a typically reduces the utilization of the functional units. For instance,
relatively large value, for instance, 27.29 for m 27 and n 60 in [55] it is shown that with eight cores it is possible to obtain for
(typical values in [28]). However, OI cannot be increased arbitrarily CNNs’ performances ranging from 3.99× to 5.76× the performance
by batching in real-time applications, as batching requires that all the of a single core. Similarly, with eight cores it is possible to reach
inputs to the LSTM layers of the batch are ready. For instance, in [50] 5.55× the performance of a single-core implementation of an LSTM
increasing batching from 16 to 64 increases performance to 2.41× the RNN [28].
original value, whereas the time required to complete execution in
more than 99% of the cases increases from 7.2 to 21.3 ms (2.95×). A. Data-Parallel Processors
When compute-intensive applications were to be addressed in the
commercial market, computer architects resorted to packed single
IV. RISC-V Vector Processors
instruction multiple data (SIMD) ISAs with the Intel’s MMX exten-
State-of-the-art processors for space applications typically execute sions (1996) for integers [56] and the SSE extensions (1999) for floats
instructions on two scalar operands [51]. Considering a single core, [57]. The success of ARM in high-end embedded applications made
this type of platform has an MTPCC of 1 FLOP/CC. the SIMD NEON extension, first introduced in the ARMv7-A Cor-
The simplest way of increasing the MTPCC of future space pro- tex-A8 (2005) [58], very popular. Also PULP, one of the most
cessors is to introduce ISA extensions with instructions defining popular sets of RISC-V cores, employs the RI5CY packed SIMD
fused multiply-add (z←wx y) and fused multiply-accumulate extension (2016) defined outside of the RISC-V standard [59].
(z←xy z) operations,**** achieving an MTPCC of 2 FLOP/CC Packed SIMD extensions are typically chosen by hardware design-
(as shown in Fig. 5). This requires modifications to the floating point ers because they can be applied to scalar processors without extensive
unit (FPU) and arithmetic-logic unit (ALU). However, the cost of modifications to the microarchitecture [60]. However, the end of
these changes in the FPUs and ALUs is limited (as, for instance, the Moore’s law is leading computer architects to use more efficient
area of these units is dominated by the multiplier). The biggest cost is ISA extensions, and ARM recently (2017) released its ARMv8-A
instead on the complexity of the register file, which is required to Scalable Vector Extension (SVE) [61]. Although previous Fujitsu’s
provide more operators to the functional units [52]. supercomputers were based on SIMD extensions of SPARC, the
To increase the MTPCC even further, DLP is the most energy- Fujitsu A64FX is the first processor based on the ARMv8-A SVE,
efficient solution available [53]. As a matter of fact, large part of the targeting supercomputer applications. It achieves 2.7 DP-TFLOPS
(7 nm process), a DGEMM efficiency >90% [62] and it is composed
****Both will be indicated with FMA, unless a distinction is to be done. by 48 computing cores, each achieving around 57 DP-GFLOPS [62].
560 DI MASCIO ET AL.
Vector extensions are already known to be more efficient than 2. Microarchitecture of Vector Processors
packed-SIMD, as they can be seen as more flexible versions of There are two main approaches to design a vector processor. Vector
packed-SIMD thanks to their time-multiplexed and vector length- processors for supercomputers, like the Fujitsu AF64X, typically
agnostic approach (the software is oblivious to the hardware vector have a joint scalar and vector pipeline with separated register files and
length of a specific implementation and the same code executes using execution units. The main disadvantage of this approach is that a
the largest parallelism available) [27,60,61]. In SIMD extensions vector load instruction stalls the pipeline also for scalar instructions,
instead, the data width of the operations is encoded in the instruction unless a superscalar pipeline with large ILP is employed (e.g., this is
opcodes. When the architects of such ISAs wish to increase perfor- done in the Fujitsu AF64X with up to four ways). When the ILP is not
mance by widening the vectors, they must add a new set of instruc- high enough, using a decoupled vector pipeline, where the scalar
tions to process these vectors. For instance, Intel’s newest AVX pipeline pushes vector instructions into an instruction queue inter-
instructions are as long as 11 bytes [60]. Furthermore, application facing the vector pipeline, can mitigate this issue. The scalar pipeline
code compiled for previous versions cannot automatically leverage can continue execution and the vector pipeline acknowledges com-
the widened vectors of new implementations. At the same time, code pletion of vector instructions and passes scalar results (when needed)
compiled for wider SIMD registers fails to execute on older machines to the scalar pipeline without passing through the bus. This approach
as the new instructions are not known to older implementations. is employed, for instance, for the Ara processor [27] and it is shown in
Furthermore, in SIMD extra code is needed to handle up to three
Fig. 6. Another advantage of this approach is that it provides a more
fringe elements of stripe mine loops [60].
modular solution and a vector version of a RISC-V processor can be
For these reasons, the proposal for packed-SIMD floating-point
achieved with minimal modifications to the scalar design (i.e., the
was dropped in favor of the Vextension for large floating-point vector
introduction of a front end).
operations [15]. However, there was interest in packed-SIMD fixed-
point operations for use in the integer registers of small RISC-V The critical elements of a vector processor are shown in Fig. 6. The
implementations. A task group is working to define the packed- following subsections will focus on the vector register file (Sec. IV.B),
SIMD P extension [15]. and on the issues limiting scalability of performance (Sec. IV.C).
Furthermore, Sec. IV.D provides insights on the soft error vulner-
1. RISC-V Vector Extension
ability of vector processors.
The RISC-V Vector Extension (RVVE) is similar to the ARMv8-A
B. Vector Register File
SVE and was heavily inspired by the Hwacha†††† development [63].
Both RVVE and ARMv8-A SVE define a configurable vector unit Vector register files (VRFs) are typically more complex than
with 32 vector registers (i.e., given a certain VRF size, the number of register files (RFs), as they have in general more contention given
elements and size of elements can be configured with instructions) FMA operations and masked execution‡‡‡‡ [27]. When considering
[15] and allow the same binary code to work efficiently across Ara, the worst case for contention for access to the VRF is the masked
a variety of hardware implementations, varying in physical vector FMA (multiply-add) instruction, which reads four operands from
storage capacity and data path parallelism. Additionally, ARMv8-A four vector registers (one mask, two factors, and one addend) [27],
SVE includes 16 scalable predicate registers (not defined in the executes the operation only if the mask has a certain value, and writes
baseline RVVE [64]) to optimize loops, using the predicate con- to a register the result of the operation. A straightforward solution to
trolled loops vectorization style [61]. avoid contention in the VRF is therefore to employ a multiported
Although the RVVE is still in the process of being standardized, it static random-access memory (SRAM) with as many ports as needed,
plays such a crucial role in state-of-the-art applications that already in this case four read ports and one write port (4R1W). However,
several developments implementing the RVVE are described in multiported register files come with a large area overhead. In [53], the
literature. The two most notable examples are the Xuantie-910, a area of the VRF for the T0 vector processor according to the different
12 nm RISC-V processor with 16 cores clocked up to 2.5 GHz number of ports employed is analyzed. As the T0 vector processor
with an out-of-order triple-issue 12-stage pipeline [65], and Ara, contains two arithmetic units and one multiplier per lane, to avoid
a RISC-V vector processor based on Ariane achieving up to 33 contention it requires one read port and one write port for the
GFLOP/s and 41 GFLOP/J on 22 nm fully-depleted silicon-on- multiplication, and two ports for read and one for write for each
insulator (FD-SOI) technology. Furthermore, work is being done to arithmetic unit (i.e., 5R3W). Different implementations in ASIC
support the RVVE in popular DNN frameworks like TensorFlow technology are proposed for the VRF, trading-off the number of
Lite [66]. banks and ports: one 5R3W bank of 256 elements (1×5R3W), two
††††
The main difference with RISC-V Vector extension is that Hwacha
‡‡‡‡
fetches its own instructions, as there are two threads: a control thread running RVVE provides for many instructions a field that specifies whether the
on the scalar core and a worker thread [60]. This can potentially lead to higher instruction is to be executed or not according to the value of a bit in a specific
performance, but also higher complexity. vector register [64].
DI MASCIO ET AL. 561
3R2W banks of 128 elements each (2×3R2W), and four 2R1W banks Table 2 Scalability of Ara in terms of number of lanes (peak values
of 64 elements each (4×2R1W). in bold) for 22FDX process (FD-SOI) (data derived from [27])
Data from [53] show that banking decreases the area occupied by Number of lanes
the VRF by 31.7% when going from 1×5R3W to 2×3R2W. How-
Performance metric 2 4 8 16
ever, the efficacy of this technique saturates quickly, as going from
2×3R2W to 4×2R1W decreases the area only by 2.1%. This is due to Max. frequency [normalized] 1.00 1.00 0.94 0.83
Max. FPU utilization [%] 98.20 98.00 97.22 97.36
the increase of overhead to handle the banks (storage cells compose
Area efficiency [DP-kFLOP/s/GE] 2.20 2.85 3.08 3.02
88.9% of the VRF for 1×5R3W, 83.1% for 2×3R2W, and only 41.7% Energy efficiency [DP-GFLOP/mJ] 35.58 37.84 39.91 40.81
for the 4×2R1W implementation).
Banking is also employed in Ara, where the VRF is composed of
eight single-ported read-or-write banks (1RW). To help avoid con-
tention, in Ara vectors are organized in SRAM banks with a shift of logic to handle the increased number of lanes. Therefore, area effi-
one element (“barber pole” shift) [27]. This is particular effective to ciency can be expected to be more critical than energy efficiency in
avoid conflicts when the functional units fetch the first elements of vector processors. Ariane and Ara occupy together between 2228 and
two vectors [27]. However, this organization leaves some residual
10,735 kGE. In particular, the area of Ariane and Ara with four lanes
contention, which is addressed with a round-robin with two priority
is 3434 kGE. i.e., 4.28 times a single-core Ariane comprising level 1
levels [27]. A way to completely solve bank contention is systolic
(L1) caches. Therefore a four-lane vector processor has similar
execution. For instance, Hwacha uses four 1R1W (4×1R1W) dual
requirements in terms of die area compared with state-of-the-art
port banks with stall-free systolic bank execution, capable to sustain
quad-core processor for space [51].
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
C. Scalability
Along with the memory bound identified by the roofline model,
the authors of Ara [27] show that the limited issue rate of instructions
Although existing RISC-V vector processors have good scalability for a single-issue scalar pipeline limits the performance for matrices
in terms of peak performance and efficiency (as can be seen in of sizes smaller than 256 × 256. Therefore, they suggest that the use
Table 2), there are still criticalities to be addressed for small matrices of higher ILP and speculation in the scalar pipeline could improve
and very high requirements of peak performance. The remainder of performance for smaller matrices, where control operations
this subsection discusses how scalability influences frequency, effi- (e.g., configuration of the lanes) have a larger overhead. Similarly
ciency, the effects of the issue rate on the achieved performance and to [27] for an n × n matrix multiplication with SP parameters, an
the width of the interconnect. upper bound due to the issue rate #FLOP 16 OI∕ΔCCissue can be
found, and OI MTPCC ΔCCissue ∕16 due to the issue rate. This
1. Frequency
equation shows that doubling the issue rate (i.e., using a dual-issue
Most considerations in previous sections were based on the fre- microarchitecture) will halve the OI . For instance, as an FMA instruc-
quency-normalized value FLOP/CC, whereas a reduction of clock tion can be issued every five clock cycles (CCs) in Ara, the worst OI is
frequency decreases the peak performance in terms of FLOP/s 5 FLOP/B (8 lanes version with MTPCC 16 FLOP/CC), whereas a
(as #FLOP∕s fCPU #FLOP∕CC) and therefore can decrease dual-issue version lowers this value to 2.5 FLOP/B. As can be seen in
the efficiency of a platform with increased DLP. Sec. V, these values are comparable with upper bounds due to memory
In [27] Ara has been implemented in Global Foundries 22FDX bandwidth and therefore can have an impact on performance when they
process (FD-SOI). As can be seen in Table 2, the two-lane and four- produce an higher OI than memory bandwidth.
lane versions of Ara achieve the same maximum nominal frequency.
In both cases, the critical path is in the DP FMA FPU (1.2 GHz 4. Interconnect
nominal, 0.92 GHz worst case), about 40 gate delays long. Another
critical path (of the same length) is present in the combinational To increase the OI due to the memory bandwidth, Ara uses a single
handshake between the Vector Load and Store Unit (VLSU) and 32 N L -wide bus interface for all the lanes together,§§§§ reaching
operand queues in the lanes of the vector processor. When increasing 512 bits for 16 lanes. To keep the same value of 2 B/DP-FLOP, a
the number of lanes, the second path becomes longer, and therefore 32-lane implementation would need a 1024-bit-wide bus interface.
the frequency is reduced (down to 1.04 GHz for 16 lanes). This is However, this problem can be mitigated using an L1 cache for vector
because the VLSU handles data to and from all the lanes simulta- data (L1V), which allows large bandwidth for data residing in it without
neously. Therefore, a larger number of lanes imply longer combina- requiring a wide crossbar (Fig. 7). The design of an area efficient
tional paths. This shows that, in general, the scalability of the DLP in memory subsystem for RISC-V vector processors is described in Sec. V.
a vector processor is limited by the elements that act on all the
lanes [27]. D. Soft Error Vulnerability
It should be noted that the maximum frequency of the scalar Vector processors typically achieve high utilization of the FPU
processor on the same technology is 1.7 GHz [5]. Therefore, the (e.g., 97% in [27]), whereas scalar processors typically work in
two-lane version already comes with a penalty of at least 30% memory-bounded conditions and therefore achieve much lower
compared with the scalar processor. FPU utilization. This implies an increase of soft error vulnerability
of arithmetic units, as suggested by the models in [68] relating
2. Area and Energy Efficiency utilization and soft error vulnerability. Furthermore, the increase of
The increasing energy efficiency in Table 2 shows good scalability frequency compared with state-of-the-art processors for space
and suggests that the peak in energy efficiency may be obtained for an (e.g., from 250 MHz to 1 GHz) points to an increased percentage
even larger number of lanes. On 22 nm FD-SOI, Ariane and Ara of errors from combinational logic (as shown in [69]), which com-
(depending on the number of lanes) consume between 138 (2 lanes) pose the majority of the area in FPUs and ALUs. For instance, we
and 794 mW (16 lanes) at peak performance [27]. As energy effi- synthesized the BOOM processor¶¶¶¶ on a 65 nm ASIC technology
ciency depends on the ASIC technology employed, changing tech- and the area of the FPU and ALUs (comprising hardware multipli-
nology will provide different efficiency. Resorting to a 65 nm RHBD cation and division) results composed, respectively, for 79.52 and
technology would decrease energy efficiency because of larger 86.11% of combinational logic. Finally, scaling efficiently at least up
power consumption for a given clock frequency. to 16 lanes, a vector processor can achieve high performance when
Area efficiency reaches a maximum for 8 lanes, as for 16 lanes the
increase due to the decreased overhead of the scalar pipeline per §§§§
Hwacha, instead, uses an interface per lane [67].
¶¶¶¶
vector lane is more than compensated by the greater complexity of the https://github.com/riscv-boom/boom-template.git.
562 DI MASCIO ET AL.
Fig. 7 Possible memory hierarchy for a vector processor. Other cores and peripherals (not shown in figure) can be connected to the interconnect.
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
Fig. 8 Theoretical improvement for low OI workloads for matrices residing in L2 and L1V compared with SDR and DDR (single chip).
large ASIC implementations are possible. For this reason, small is added to increase performance especially for workloads with low
technology nodes should be preferred. However, in [70] it is reported OI. The figure also indicates the width W i of the interface between
that going below 28 nm increases the soft error rate (SER) in the levels, which determines the bandwidth Bi of the interface together
terrestrial environment. In FD-SOI technologies this is mainly due to with its clock frequency fclki , according to Bi fclki W i . For
an increase of SER due to protons, whereas the SER due to alpha instance, the Sandy Bridge in [33] has a 384-bit interface and a
particles is slightly decreasing. Given that in space there is a different maximum bandwidth of 384 b/CC. In the case of DRAMs, BD is
radiation environment, the technology node minimizing the SER given by RD CD fclkD W D, where RD is 1 for SDR and 2 for
may be different. DDR, CD is the number of channels for the DRAM, fclk the clock
The separation between scalar and vector pipeline in decoupled frequency, and W the word size. For the DRAM employed in the
vector processors allows for a selective hardening approach. Sandy Bridge in [33] CD 2, fclkD 0.8 GHz, and W D 64, and
Assuming that control operations are executed only in the scalar therefore BD is 25.6 GB/s.
pipeline and computations only in the vector pipeline, redundancy A cache-aware roofline model [33], shown in Fig. 8, highlights the
to avoid catastrophic failures is required only in the scalar pipeline. main benefits of adopting a memory hierarchy similar to Fig. 7. When
In Ara, the critical path limiting the maximum frequency for the data reside in main memory, OI is around 2.50–6.02 FLOP/B
four-lane version is in the vector pipeline and allows for a maximum (depending on the DRAM technology), whereas if data reside in
frequency of around 1 GHz, whereas the scalar pipeline has a an L2 (with W X 64 b) OI becomes 0.25 FLOP/B and a dedi-
critical path allowing up to 1.7 GHz [5]. Therefore, applying cated L1V with W C;V 356 b reduces OI to 0.04 FLOP/B.
state-of-the-art techniques to improve fault tolerance only to the
Furthermore, from Fig. 8 it can be deduced that keeping a processor
scalar pipeline, such as triple modular redundancy (TMR) at flip-
in a compute-bounded state for a given OI puts increasingly higher
flop level in the scalar pipeline and error detection and correction
requirements on the memory bandwidth when MTPCC (hence the
(EDAC) codes in the scalar register file, will not cause any penalties
computational capabilities) is increased (e.g., an implementation
in terms of maximum frequency and hence in terms of MTPCC .
with lower MTPCC has a lower OI ). As a result, extremely high-
As a matter of fact, TMR and EDAC are reported to cause only
performance processors for DNNs are actually memory-bounded
9% decrease in frequency in the LEON2 [71]. A similar decrease
except for very high OI [50].
would keep the maximum frequency of Ariane from 1.7 GHz [5] to
around 1.5 GHz, which is still above the maximum frequency
possible in the vector pipeline. A. Main Memory
The need for (at least) radiation-tolerant parts with solid flight
heritage limits the use of state-of-the-art memories. As a result, main
V. Memory Hierarchy memories for space in ESA missions lag behind commercial counter-
Figure 7 shows a possible memory hierarchy for a vector proces- parts in terms of performance. For instance, state-of-the-art OBCs
sor. As a typical memory hierarchy for scalar processors, it comprises typically employ single data rate (SDR) DRAM [72]. The SDR
an L1 cache for scalar data (L1D), a L1 instruction cache (L1I), a DRAM tested in [25] (ISSI IS42S86400B-7TL) has 16 bits for data
unified level 2 cache (L2),***** and a main memory. However, an L1V I/O and achieves up to 166 MHz. Therefore, its BD is 2.66 Gbps, i.e.,
*****This is typically the case of multicore processors (not shown in the two orders of magnitude less compared with the DDR3 DRAM
figure), where more cores with their own L1 caches are connected to the L2 via memories used in [33]. Faster DRAMs are also being considered,
an interconnect. as the DDR2 tested in [25] (IS43DR81280B-25DBLI), which has
DI MASCIO ET AL. 563
8 bits for I/O data and achieves up to 400 MHz. This means a BD of fCPU ∕fMC 2, we estimate 200 CCs of additional latency seen
6.4 Gbps, which is still more than one order of magnitude lower by the processor during reads due to the use of RS. This is a
compared with the DDR DRAM in [33]. significant increase (e.g., read latency of the DRAM chip around
20 ns [26], i.e., 15–20 CCs for fCPU 1 GHz), and therefore it may
1. EDAC Codes be required to lower the level of information redundancy or not
In the space environment, DRAMs suffer from single event upsets applying EDAC altogether on vector data to achieve the required
(SEUs) and multiple bit upsets (MBUs) as SRAMs [73]. However, in level of performance.
DRAMs most of the upsets happen in weakened cells [74]. Further-
more, compared with SRAMs, DRAMs are also more likely to suffer 2. Vulnerability of DNN Parameters
from stuck bits (cells stuck to a value, mostly related to variable bit To evaluate the effect of not applying EDAC on the DRAM when
retention [75]) and single event functional interrupts (SEFIs). The running a DNN, we estimate the effect of upsets on the parameters
effect of SEFIs in a DRAM ranges from some tens of bits to a full chip residing in the DRAM for CloudNet.
wrong per read cycle and can be recovered only with a chip reset or According to [74], a 512 Mb SDR DRAM memory
sometimes with a full power cycle [74]. To detect and correct these (MMSD08512408S-Y) experiences 2.75e–11 upset/bit/day in
errors, EDAC codes are employed in the DRAM. Including EDAC LEO. Therefore, 0.19 upsets/day are to be expected for coefficients
checkbits in DRAMs decreases the bandwidth, as also checkbits are and feature maps residing in the DRAM (using the peak memory
read and written, and increases latency, as the checkbits have to be reported in Sec. III). To assess the sensitivity to SEUs, we ran a fault
calculated before storing the data in memory and checked before injection campaign on the DNN coefficients expressed in SP floating
using the data read from the memory. For DRAMs in space embedded
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
3. Proposed Solutions for DRAMs workloads and finds that caches can significantly improve the perfor-
While the effect of SEUs on parameters can be tolerated by the mance of a vector processor. Furthermore, in [84] it is shown that
intrinsic robustness of DNNs, SEFIs produce an unpredictable number the use of caches helps masking memory latency, as increasing by
of errors per CC and therefore require mitigation. According to data 3.21× the latency of a memory access (from 14 CCs to 45 CCs)
from [74], a 512 Mb SDR DRAM memory (MMSD08512408S-Y) roughly triplicates the mean delay per memory reference for a proc-
experiences 1.33e−3 SEFI/device/day. To achieve the peak memory essor with uncached vector data and less than doubles the access time
required, 14 chips are required and therefore not including any EDAC for a processor with an L1 cache for vector data.
will produce a failure due to SEFIs every 53.7 days. This is unaccept- The following subsections will carry out a design exploration of
able, as every inference after the SEFI is likely to have insufficient QoS the L1V to assess which sizes, organizations, and write policies are
until the next reset of the failing chip. As a mitigation, DRAM chips more efficient for vector processors.
can be reset periodically. Assuming a reset every 2 h, the percentage of
failed inferences due to SEFIs WI SEFI in the worst case is 1. Size
From Table 1, it is clear that the large matrices originating from
FailuresSEFI ΔT rst unrolling of convolutional layers (ranging from 3 to 41 MiB) do not
WISEFI 0.16% (10)
Total inferences MTTFSEFI fit even in large L2 caches (e.g., 2 MiB [51]). This problem can be
addressed with tiling, as shown in Fig. 9. In this approach, two levels
The contribution to wrong inferences of SEUs can be estimated with of looping (shown in Fig. 9 with index i and j) select a subset of the
a similar equation, where the MTTFSEU in the denominator is divided matrix–matrix multiplication that produces one of the
by 0.03 to account for the discussion in Sec. V.A.2 on the vulnerable
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
bits of floating point coefficients and T rst is replaced with the time UV N
required for a single inference T inf . The value found is negligible b b
(two orders of magnitude less than the contribution of SEFIs). How-
ever, the final value of average reliability Ravg 1 − WI SEU − WISEFI tiles of the result, each composed of b × b elements. By increasing
(99.84%) can be not deemed enough for critical applications. The the size of the cache, it is possible to work on larger matrix blocks
availability instead depends also on the maintenance time after a reset. residing in the L1V. The subset of operations obtained in Fig 9b can
If we assume a maintenance time of 30 s for each reset, we find that the be decomposed into
availability of the service is 99.58%, whereas a maintenance time of
300 s produces an availability of 95.83%. Both values are below typical CJK
requirements of dependable systems (e.g., [82]). b
A tradeoff between RS and no EDAC is represented by simpler
EDAC codes. EDAC codes with lower redundancy, although they segments, and the results of these segments can be accumulated to
cannot mask SEFIs, can still detect some of the wrong bits caused by generate the final result of the tile. The level (c) in Fig. 9 is where the
the SEFI. For instance, a parity bit per chip can detect an odd number mapping to SGEMM (described in Sec. III.A.1) can be applied.
of errors in a chip, and it is possible to keep track of them with a One of the possible implementations of SGEMM (Fig. 9d) is a loop
counter. When the number of errors from a chip exceeds a certain selecting the mth column of A0 and the mth row of A1 and generating
threshold in a certain time window, the DRAM chip is reset to recover a matrix where the pth column is the mth column of A0 multiplied by
from a probable SEFI. Assuming a threshold of three errors and an A1mp . Vectorization is applied with a maximum vector length of V L ,
equal probability that the SEFI will cause an even or odd number of with FMA (accumulate) operations between the vector A0m and a
errors, the percentage of wrong inferences due to SEFIs is scalar A1mp. A matrix representation of this implementation for a
2N thr 1 2 × 2 example is shown below.†††††
WISEFI 0.0009% (11) 0 1
MTTFSEFI ∕ΔT inf
A011 A111 A012 A121 A011 A112 A012 A121
Regarding SEUs, neglecting accumulation and MBUs, all the A2 @ A
upsets are detected. Therefore Rav 99.9991%, which is a substan- A021 A111 A022 A121 A021 A112 A012 A122
tial increment compared with employing no EDAC. There is a 0 1 0 1
substantial increment in availability too, with 99.9994 and j j j j
B C B C
99.994%, respectively, for 30 and 300 s of unavailability per reset. B C B C
B A01 A111 A01 A112 C B A02 A121 A02 A122 C
Table 3 summarizes the different EDAC and reset approaches @ A @ A
discussed to protect DRAMs for DNNs. j j j j
a)
b)
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
c)
d)
Fig. 9 Example of tiling of a matrix–matrix multiplication. “Acc.” stands for accumulation.
row vector of length b from main memory to L1V is UV CJK N
T L;CJK×UV T L;CJK×b T b×b 0
b b b
b SE b
T L;b T L;V
VL BM with b 0 UVmodb.
Similar equations can be derived for storing the result, substituting
and the time required to read an entire b × b tile is T L;b×b T L;b b. the subscript L with S. Only the final result for each tile is written to
The time required to read a fringe b × b 0 tile with b 0 < b is instead main memory; therefore the time to store all the results is
0
b SE b 0 UV N
T L;b×b 0 T L;V b T S;N×UV T S;N×b T L;b×b 0
VL BM b b
There are three possible implementations, depending on which tile with b 0 UVmodb.
(of the coefficient, input feature, and output feature matrix) is kept Considering the associate continuous functions (without modulo,
into the L1V during the innermost looping. Assuming that the output ceiling, and floor functions), it is possible to prove that the fastest
feature matrix is kept in L1V, the time required during the loop on implementation is the one keeping in L1V the tile of the output
CJK × b to load all the tiles in a CJK × b stripe of the CJK × UV feature matrix. This is because this implementation does not require
input feature matrix (as shown in Fig. 9b) is loading and storing of the temporary tile of the output matrix during
accumulation.
CJK To trade off the speed-up against the increase in size due to a larger
T L;CJK×b T L;b×b L1V, we consider the area efficiency in terms of FLOP/CC/GE for
b
matrix multiplications with matrices residing in L1V. To give a
whereas for a b × CJK stripe of the N × CJK matrix realistic estimation of the cache size that maximizes the area effi-
ciency, we consider what the effect of adding an L1V to Ara would
be in terms of area. The area of Ariane and Ara ranges from 2228
CJK
T L;b×CJK T L;b×b T b×b 0 for two lanes to 10,735 kGE for 16 lanes. As a worst case for
b memory-bounded conditions, we assume 16 lanes (V L 16), and
in this case the area without L1V is 10,735 GE. The area of the L1
where b 0 CJKmodb. As every column has to be multiplied for cache is estimated as AL1V;GE 6∕4N b , assuming 6T SRAM cells
every row, the total time spent reading the coefficient matrix is and a GE corresponding to four transistors.
We will consider four cases comprising all the combinations of
UV N memory with latency 50 CCs (representative of the latency without
T L;N×CJK T L;b×CJK
b b RS) and 300 CCs (representative of the latency with RS) and with
bandwidths of 4 and 40 b/CC (respectively, representative of a
where the ceiling is required because all the matrix of the coefficient memory module with 4 SDR chips and 4 DDR chips). Table 4 shows
is to be read again even if only one column of the input feature is left to the results of this model. The main observations are that the optimal
be loaded. Similarly, the total time spent reading the CJK × UV size of L1V is much larger (256 KiB-1 MiB) than a typical L1D
matrix is instead (e.g., 16 KiB [51]) and that the most impacting factor on the area
566 DI MASCIO ET AL.
Table 4 Estimates of area Atot [MGE] and area efficiency AE [FLOP/CC/MGE] for a 16-lane vector
processor with different sizes of L1V, main memory (latency and bandwidth), and maximum size
of the tile b × b when applying tiling to the layers of CloudNet
Characteristic 64 KiB 128 KiB 256 KiB 512 KiB 1 MiB 2 MiB
b 40 60 84 120 168 240
Atot 11.5 12.3 13.9 17.0 23.3 35.9
Layer 1: C 4; N 16; J K 3; U V 192
AE50;40 1.06E 0 1.44E 0 1.91E 0 2.35E 0 2:44E 0 2.34E 0
AE50;4 4.09E − 1 5.44E − 1 7.23E − 1 8.59E − 1 8:82E − 1 8.28E − 1
AE300;4 1.57E − 1 2.14E − 1 2.84E − 1 3.48E − 1 3:59E − 1 3.44E − 1
AE300;40 2.06E − 1 2.83E − 1 3.76E − 1 4.68E − 1 4:86E − 1 4.71E − 1
Layer 11: C 128, N 256, J K 3, U V 24
AE50;40 4.04E − 1 1.58E 0 2.03E 0 2:40E 0 2.33E 0 2.08E 0
AE50;4 2.62E − 1 5.97E − 1 7.65E − 1 8:72E − 1 8.43E − 1 7.33E − 1
AE300;4 6.48E − 2 2.35E − 1 3.01E − 1 3:54E − 1 3.43E − 1 3.05E − 1
AE300;40 7.09E − 2 3.11E − 1 3.99E − 1 4:78E − 1 4.63E − 1 4.17E − 1
Layer 19: C 512, N 1024, J K 3, U V 6
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
efficiency is the dimensions of the convolution. For each layer, one assembly (which supports vector load and store strides of size
cache size maximizes the area efficiency independently of latency 1,2,3,4,8) using TVM.§§§§§ The fraction of vector accesses for stride
and bandwidth. This value decreases from 1 MiB to 256 KiB when 1, stride 2, and stride 4 are, respectively, 97.13, 1.62, and 1.25%. No
going from layers with large U V and small C and N to layers with accesses with stride 3 (supported in NEON) have been found. Trans-
small U V and large C and N. This means that processors intended lating other DNNs leads instead to only unit stride accesses. For
to run deeper CNNs can employ smaller caches with lower penalty. instance, translating the popular resnet18_v1 [41] model did not
However, the maximum area efficiency decreases going from layer 1 produce nonunit stride accesses.
to layer 11 to layer 19. These findings suggest that, although in a first phase this problem
could be mitigated relying on certain choices of DNN architectures
2. Organization and software implementation to reduce the fraction of nonunit vector
The model in the previous section assumes that it is possible to strides, in general different cache organizations are needed compared
keep the tiles in L1V, avoiding that loading a vector of one of the tiles with those typically employed for scalar processors.
causes the eviction of data belonging to one of the other tiles required.
Whether this happens or not depends on the cache organization and 3. Write Policy
an ineffective organization requires larger caches to allow the tiles to A microarchitecture with separated scalar and vector data caches
reside in the cache during computations. requires a solution to handle memory coherence issues when data in
Data-parallel ISA extensions (also the RVVE [64]) typically sup- one of the two is modified and an old value is read from the other. This
port vector load and store operations with nonunit stride V S ; i.e., two can be addressed with a write-through policy for L1V and L1D,
contiguous elements of the vector are placed in noncontiguous although this comes with substantial penalties especially in terms
location separated by V S − 1 elements. According to the model in of power [87], memory traffic [88], and performance [89].
[84], the fraction of nonunit strides in a workload determines whether
organizations similar to those of scalar processors are enough to
achieve acceptable performance or organizations specific for vector VI. Conclusions
processors are required. One example of the latter is prime-mapped The recent shift of focus of the space industry from large GEO to
caches [84], which have a conflict-free memory organization for small LEO satellites opens up new challenges. Limited downlink data
vectors with power-of-two strides. However, they have no advantage rates and short communication windows typically allow the trans-
against direct-mapped caches (the simplest cache organization for mission of just a fraction of the data generated by on-board sensors in
scalar processor) when all the strides are unitary. In [53] the break- small LEO satellites. The efficiency of the downlink can be increased
down of vector access in terms of vector memory accesses for 20 with data compression and with data removal (e.g., removing images
benchmarks running on tree different vector machines (Cray90, that have a certain percentage of pixels covered with clouds). This
Alliant FX/8, Convex C3) is reported. The respective percentages solution requires a dedicated processor that comes at relatively high
are 66.37% unit stride, 24.24% other strides, and 9.40% indexed (also cost in terms of power (around 5 W), which can be sustained only by
known as “scatter and gather” and also supported by the RVVE [64]). relatively large satellites. Furthermore, long periods without contact
The improvement with prime-mapped caches for a typical workload with the base station require an on-board virtual operator, monitoring
with unit stride of 70% is 2× over the cacheless version, whereas the the status of the satellite and making decisions when the communi-
improvement for direct mapped caches is below 1.5× [84]. cation with the ground station is not possible.
Typical applications that require nonunit strides are fast Fourier These challenges in terms of downlink efficiency and depend-
transform (FFT) and its inverse (IFT) [84]. FFT is employed in ability can be addressed with DNNs when it is possible to build
several compute-intensive workloads. For instance, in [86] it is relatively large datasets (e.g., thousands of images or months of
proposed to speed-up CNN execution, as convolutions can be sub- telemetry). Therefore, there is a need for large, public, and standard-
stituted by a sequence of FFT, elementwise multiplication, and IFT. ized datasets to be used as challenges for DNN architectures to
To investigate whether vector loads and stores with nonunit strides
§§§§§
are present in DNNs, we translated CloudNet into ARM NEON https://github.com/apache/incubator-tvm.
DI MASCIO ET AL. 567
be deployed in space applications. However, part of future LEO Journal of Aerospace Information Systems, Vol. 16, No. 11, 2019,
satellites are planned to be launched in large constellations, making pp. 454–472.
large datasets more easily available in the future. https://doi.org/10.2514/1.I010735
The analysis of the workloads associated with DNNs shows [5] Zaruba, F., and Benini, L., “The Cost of Application-Class Processing:
Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit
that most parts are very compute-intensive and can be mapped to RISC-V Core in 22-nm FDSOI Technology,” IEEE Transactions on
matrix–matrix multiplications, for which DLP is the most efficient Very Large Scale Integration (VLSI) Systems, Vol. 27, No. 11, 2019,
microarchitectural solution to increase execution speed. Among the pp. 2629–2640.
data-parallel ISA extensions available, the RVVE is gaining momen- https://doi.org/10.1109/TVLSI.2019.2926114
tum because of its openness and efficiency. Although there are [6] Li, X., Adve, S. V., Bose, P., and Rivers, J. A., “Architecture-Level Soft
already processors based on the RVVE, the software ecosystem of Error Analysis: Examining the Limits of Common Assumptions,”
the RVVE is in an early stage, as the ISA specifications are not frozen 37th Annual IEEE/IFIP International Conference on Dependable Sys-
yet. Therefore, during the early development of a RISC-V vector tems and Networks (DSN’07), IEEE Publ., Piscataway, NJ, 2007,
processor, some adjustments may be required. This is a risk that can pp. 266–275.
https://doi.org/10.1109/DSN.2007.15
be accepted given the long development times of space processors.
[7] Blacker, P., Bridges, C. P., and Hadfield, S., “Rapid Prototyping of
The analysis of the microarchitecture of a vector processor shows Deep Learning Models on Radiation Hardened CPUs,” 2019 NASA/
possible criticalities both for computational capabilities and for the ESA Conference on Adaptive Hardware and Systems (AHS), IEEE
memory hierarchy. For instance, the scalability with the number of Publ., Piscataway, NJ, 2019, pp. 25–32.
lanes can be an issue, especially for operations involving all of them. https://doi.org/10.1109/AHS.2019.000-4
The width of the bus interface has also been found to be a possible [8] Lai, L., and Suda, N., “Enabling Deep Learning at the IoT Edge,” 2018
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
bottleneck, and the use of an L1V has been suggested as a possible IEEE/ACM International Conference on Computer-Aided Design
mitigation approach. L1 caches for vector data maximize the area (ICCAD), IEEE Publ., Piscataway, NJ, 2018, pp. 1–6.
efficiency when executing convolutional layers when their size is https://doi.org/10.1145/3240765.3243473
[9] Furano, G., Meoni, G., Dunne, A., Moloney, D., Ferlet-Cavrois, V.,
around 256 KiB–1 MiB. Furthermore, the microarchitecture of the
Tavoularis, A., Byrne, J., Buckley, L., Psarakis, M., Voss, K.-O., and
scalar pipeline affects the performance for small OI, given the limited Fanucci, L., “Towards the Use of Artificial Intelligence on the Edge in
issue rate of microarchitectures with low ILP. Furthermore, it is Space Systems: Challenges and Opportunities,” IEEE Aerospace and
possible to apply to decoupled vector and scalar pipelines different Electronic Systems Magazine, Vol. 35, No. 12, 2020, pp. 44–56.
approaches in terms of redundancy to reduce penalties in terms of https://doi.org/10.1109/MAES.2020.3008468
performance. [10] Lentaris, G., Maragos, K., Stratakos, I., Papadopoulos, L., Papaniko-
The relatively large size and the focus on high performance of laou, O., Soudris, D., Lourakis, M., Zabulis, X., Gonzalez-Arjona, D.,
vector processors requires the identification of a radiation-tolerant and Furano, G., “High-Performance Embedded Computing in Space:
ASIC technology with a technology node around 28 nm (considering Evaluation of Platforms for Vision-Based Navigation,” Journal of Aero-
also the SER), whereas state-of-the-art processors in space systems space Information Systems, Vol. 15, No. 4, 2018, pp. 178–192.
https://doi.org/10.2514/1.I010555
are typically still based on RHBD 65 nm technologies. Furthermore, [11] Pignol, M., “COTS-Based Applications in Space Avionics,” 2010
an ASIC technology with multiported SRAMs is required for an area- Design, Automation Test in Europe Conference Exhibition (DATE
efficient implementation of the VRF. 2010), IEEE Publ., Piscataway, NJ, 2010, pp. 1213–1219.
Finally, this work investigated the performance and dependability https://doi.org/10.1109/DATE.2010.5456992
characteristics of the main memory, one of the most important [12] Del Sozzo, E., Solazzo, A., Miele, A., and Santambrogio, M. D., “On the
tradeoffs in space embedded systems. Demanding applications Automation of High Level Synthesis of Convolutional Neural Net-
(e.g., image classification) require a main memory with around 1 works,” 2016 IEEE International Parallel and Distributed Processing
GiB capacity, which is more than the typical DRAM capacity Symposium Workshops (IPDPSW), IEEE Publ., Piscataway, NJ, 2016,
required in many space mission. When availability is not a primary pp. 217–224.
https://doi.org/10.1109/IPDPSW.2016.153
concern, EDAC codes for DRAMs with low redundancy and latency [13] Xi, S. L., Yao, Y., Bhardwaj, K., Whatmough, P., Wei, G.-Y., and
can be employed to detect SEFIs and restart DRAM chips in non- Brooks, D., “SMAUG: End-to-End Full-Stack Simulation Infrastructure
critical applications. In even less critical applications, periodic resets for Deep Learning Workloads,” ACM Transactions on Architecture and
of DRAM chips can be deemed sufficient. For critical applications Code Optimization, Vol. 17, No. 4, 2020, pp. 1–26.
RS is still required. Therefore, some performance-demanding appli- https://doi.org/10.1145/3424669
cations requiring high availability (e.g., online processing) may be [14] Andersson, J., “Development of a NOEL-V RISC-V SoC Targeting
unfeasible. Space Applications,” 2020 50th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks Workshops (DSN-
W), IEEE Computer Soc., Los Alamitos, CA, 2020, pp. 66–67.
Acknowledgments https://doi.org/10.1109/DSN-W50199.2020.00020
[15] “The RISC-V Instruction Set Manual Volume I: Unprivileged ISA,
This work was supported by the European Space Agency under Document Version 20190608-Base-Ratified,” RISC-V Foundation,
the NPI Program, Cobham Gaisler AB, and Delft University of 2019, https://content.riscv.org/wp-content/uploads/2019/06/riscv-spec.
Technology. pdf.
[16] Henry, C., “Geostationary Satellite Orders Bouncing Back,” 2020,
https://spacenews.com/geostationary-satellite-orders-bouncing-back/.
References [17] Lal, B., Sylak-Glassman, E., Mineiro, M., Gupta, N., Pratt, L., and
[1] Lemley, J., Bazrafkan, S., and Corcoran, P., “Deep Learning for Con- Azari, A., “Global Trends in Space Volume 2: Trends by Subsector and
sumer Devices and Services: Pushing the Limits for Machine Learning, Factors that Could Disrupt Them,” Vol. 2, Inst. for Defense Analyses,
Artificial Intelligence, and Computer Vision,” IEEE Consumer Elec- Science & Technology Policy Inst., IDA Paper P-5242, 2015, https://
tronics Magazine, Vol. 6, No. 2, 2017, pp. 48–56. www.ida.org/-/media/feature/publications/g/gl/global-trends-in-space-
https://doi.org/10.1109/MCE.2016.2640698 volume-2-trends-by-subsector-and-factors-that-could-disrupt-them/
[2] Schwank, J. R., Shaneyfelt, M. R., and Dodd, P. E., “Radiation Hardness p5242v2.ashx.
Assurance Testing of Microelectronic Devices and Integrated Circuits: [18] Maral, G., Bousquet, M., and Sun, Z., Satellite Communications Sys-
Radiation Environments, Physical Mechanisms, and Foundations for tems: Systems, Techniques and Technology, Wiley, Hoboken, NJ, 2020,
Hardness Assurance,” IEEE Transactions on Nuclear Science, Vol. 60, Chap. 1.
No. 3, 2013, pp. 2074–2100. [19] Radtke, J., Kebschull, C., and Stoll, E., “Interactions of the Space Debris
https://doi.org/10.1109/TNS.2013.2254722 Environment with Mega Constellations—Using the Example of the
[3] Wyrwas, E., “Proton Testing of AMD e9173 GPU,” 2019, https://nepp OneWeb Constellation,” Acta Astronautica, Vol. 131, Feb. 2017,
.nasa.gov/files/30362/NEPP-TR-2019-Wyrwas-TR-19-022_AMD- pp. 55–68.
e9173-GPU-2019 June02-TN72682.pdf. https://doi.org/10.1016/j.actaastro.2016.11.021
[4] Di Mascio, S., Menicucci, A., Gill, E., Furano, G., and Monteleone, C., [20] Selva, D., and Krejci, D., “A Survey and Assessment of the Capabilities
“Leveraging the Openness and Modularity of RISC-V in Space,” of Cubesats for Earth Observation,” Acta Astronautica, Vol. 74, May
568 DI MASCIO ET AL.
DDR2 and DDR3 Memories,” 2016 IEEE Radiation Effects Data Work- lation (ISMS), IEEE Publ., Piscataway, NJ, 2016, pp. 174–179.
shop (REDW), IEEE Publ., Piscataway, NJ, 2016, pp. 1–7. https://doi.org/10.1109/ISMS.2016.14
https://doi.org/10.1109/NSREC.2016.7891742 [44] Lai, L., Suda, N., and Chandra, V., “Cmsis-nn: Efficient Neural Network
[26] “IS43/46DR81280B(L), IS43/46DR16640B(L) Datasheet,” Integrated Kernels for Arm Cortex-m cpus,” arXiv preprint arXiv:1801.06601, 2018.
Silicon Solution, Inc. (ISSI), 2015, http://www.issi.com/WW/pdf/43- [45] Lee, C.-Y., Gallagher, P., and Tu, Z., “Generalizing Pooling Functions in
46DR81280B-16640B.pdf. CNNs: Mixed, Gated, and Tree,” IEEE Transactions on Pattern Analy-
[27] Cavalcante, M., Schuiki, F., Zaruba, F., Schaffner, M., and Benini, L., sis and Machine Intelligence, Vol. 40, No. 4, 2018, pp. 863–875.
“Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Proces- https://doi.org/10.1109/TPAMI.2017.2703082
sor with Multiprecision Floating-Point Support in 22-nm FD-SOI,” [46] Cong, J., and Xiao, B., “Minimizing Computation in Convolutional
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Neural Networks,” Artificial Neural Networks and Machine Learning—
Vol. 28, No. 2, 2020, pp. 530–543. ICANN 2014, edited byS. Wermter, C. Weber, W. Duch, T. Honkela,
https://doi.org/10.1109/TVLSI.2019.2950087 P. Koprinkova-Hristova, S. Magg, G. Palm, and A. E. P. Villa, Springer
[28] Cappellone, D., Di Mascio, S., Furano, G., and Ottavi, A. M. M., “On International Publishing, Cham, Switzerland, 2014, pp. 281–290.
Board Satellite Telemetry Forecasting with RNN on RISC-V Based https://doi.org/10.1007/978-3-319-11179-7_36.
Multicore Processor,” 2020 IEEE International Symposium on Defect [47] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares,
and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), IEEE F., Schwenk, H., and Bengio, Y., “Learning Phrase Representations
Publ., Piscataway, NJ, 2020, pp. 1–6. Using RNN Encoder-Decoder for Statistical Machine Translation,”
https://doi.org/10.1109/DFT50435.2020.9250796 Proceedings of the 2014 Conference on Empirical Methods in Natural
[29] Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S., “Efficient Processing Language Processing (EMNLP), Assoc. for Computational Linguistics,
of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the Stroudsburg, PA, 2014, pp. 1724–1734.
IEEE, Vol. 105, No. 12, 2017, pp. 2295–2329. [48] Graves, A., “Supervised Sequence Labelling with Recurrent Neural
https://doi.org/10.1109/JPROC.2017.2761740 Networks,” Ph.D. Dissertation, Technical Univ. of Munich, Munich,
[30] Luo, C., Li, X., Wang, L., He, J., Li, D., and Zhou, J., “How Does the 2008.
Data Set Affect CNN-based Image Classification Performance?” 2018 [49] Graves, A., Mohamed, A.-R., and Hinton, G., “Speech Recognition with
5th International Conference on Systems and Informatics (ICSAI), IEEE Deep Recurrent Neural Networks,” 2013 IEEE International
Publ., Piscataway, NJ, 2018, pp. 361–366. Conference on Acoustics, Speech and Signal Processing, Inst. of
https://doi.org/10.1109/ICSAI.2018.8599448 Electrical and Electronics Engineers, New York, 2013, pp. 6645–
[31] Phiri, D., and Morgenroth, J., “Developments in Landsat Land Cover 6649.
Classification Methods: A Review,” Remote Sensing, Vol. 9, No. 9, [50] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R.,
2017, pp. 967. Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-L.,
https://doi.org/10.3390/rs9090967 Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B.,
[32] Williams, S., Waterman, A., and Patterson, D., “Roofline: An Insightful Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.
Visual Performance Model for Multicore Architectures,” Communica- R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A.,
tions of the ACM, Vol. 52, No. 4, 2009, p. 65–76. Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar,
https://doi.org/10.1145/1498765.1498785 N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K.,
[33] Ilic, A., Pratas, F., and Sousa, L., “Cache-Aware Roofline Model: Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K.,
Upgrading the Loft,” IEEE Computer Architecture Letters, Vol. 13, Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omer-
No. 1, 2014, pp. 21–24. nick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A.,
https://doi.org/10.1109/L-CA.2013.6 Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Stein-
[34] Mohajerani, S., and Saeedi, P., “Cloud-Net: An End-to-End Cloud berg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E.,
Detection Algorithm for Landsat 8 Imagery,” IGARSS 2019—2019 Vasudevan, V., Walter, R., Wang, W., Wilcox, E., and Yoon, D. H., “In-
IEEE International Geoscience and Remote Sensing Symposium, IEEE Datacenter Performance Analysis of a Tensor Processing Unit,”
Publ., Piscataway, NJ, 2019, pp. 1029–1032. SIGARCH Computer Architecture News, Vol. 45, No. 2, 2017, p. 1–12.
https://doi.org/10.1109/IGARSS.2019.8898776 https://doi.org/10.1145/3140659.3080246
[35] Shelhamer, E., Long, J., and Darrell, T., “Fully Convolutional Networks [51] Andersson, J., Hjorth, M., Johansson, F., and Habinc, S., “LEON
for Semantic Segmentation,” IEEE Transactions on Pattern Analysis Processor Devices for Space Missions: First 20 Years of LEON in
and Machine Intelligence, Vol. 39, No. 4, 2017, pp. 640–651. Space,” 2017 6th International Conference on Space Mission Chal-
https://doi.org/10.1109/TPAMI.2016.2572683 lenges for Information Technology (SMC-IT), IEEE Publ., Piscataway,
[36] Bianco, S., Cadene, R., Celona, L., and Napoletano, P., “Benchmark NJ, 2017, pp. 136–141.
Analysis of Representative Deep Neural Network Architectures,” IEEE https://doi.org/10.1109/SMC-IT.2017.31
Access, Vol. 6, Oct. 2018, pp. 64,270–64,277. [52] Lopez, D., Llosa, J., Ayguade, E., and Valero, M., “Impact on Perfor-
https://doi.org/10.1109/ACCESS.2018.2877890 mance of Fused Multiply-Add Units in Aggressive VLIW Architec-
[37] Abdelouahab, K., Pelcat, M., Sérot, J., and Berry, F., “Accelerating CNN tures,” Proceedings of the 1999 International Conference on Parallel
inference on FPGAs: A Survey,” 2018, http://arxiv.org/abs/1806.01683. Processing, IEEE Publ., Piscataway, NJ, 1999, pp. 22–29.
[38] Dumoulin, V., and Visin, F., “A Guide to Convolution Arithmetic for [53] Asanovic, K., and Wawrzynek, J., Vector Microprocessors, Univ. of
Deep Learning,” arXiv preprint arXiv:1603.07285, 2016. California, Berkeley, CA, 1998.
[39] Chellapilla, K., Puri, S., and Simard, P., “High Performance Convolu- [54] Lee, S.-J., Park, S.-S., and Chung, K.-S., “Efficient SIMD Implementation
tional Neural Networks for Document Processing,” Tenth International for Accelerating Convolutional Neural Network,” Proceedings of the 4th
DI MASCIO ET AL. 569
International Conference on Communication and Information Processing, [70] Hubert, G., Artola, L., and Regis, D., “Impact of Scaling on the
Assoc. for Computing Machinery, New York, 2018, pp. 174–179. Soft Error Sensitivity of Bulk, FDSOI and FinFET Technologies
https://doi.org/10.1145/3290420.3290444 due to Atmospheric Radiation,” Integration, Vol. 50, June 2015,
[55] Flamand, E., Rossi, D., Conti, F., Loi, I., Pullini, A., Rotenberg, F., and pp. 39–47.
Benini, L., “GAP-8: A RISC-V SoC for AI at the Edge of the IoT,” 2018 https://doi.org/10.1016/j.vlsi.2015.01.003
IEEE 29th International Conference on Application-specific Systems, [71] Gaisler, J., “A Portable and Fault-Tolerant Microprocessor Based on the
Architectures and Processors (ASAP), IEEE Publ., Piscataway, NJ, SPARC v8 Architecture,” Proceedings International Conference on
2018, pp. 1–4. Dependable Systems and Networks, IEEE Publ., Piscataway, NJ,
https://doi.org/10.1109/ASAP.2018.8445101. 2002, pp. 409–415.
[56] Peleg, A., and Weiser, U., “MMX Technology Extension to the Intel https://doi.org/10.1109/DSN.2002.1028926
Architecture,” IEEE Micro, Vol. 16, No. 4, 1996, pp. 42–50. [72] “OSCAR OBC,” Airbus, 2018, https://www.airbus.com/content/dam/
https://doi.org/10.1109/40.526924 products-and-solutions/space/spacecraft-equipment/sce-datasheets/
[57] Thakkur, S., and Huff, T., “Internet Streaming SIMD Extensions,” Publication-sce-oscar.pdf.
Computer, Vol. 32, No. 12, 1999, pp. 26–34. [73] Petit, S., David, J. P., Falguere, D., Duzellier, S., Inguimbert, C., Nuns,
https://doi.org/10.1109/2.809248 T., and Ecoffet, R., “Memories Response to MBU and Semi-Empirical
[58] Doolan, D. C., Tabirca, S., and Yang, L. T., “Mobile Parallel Comput- Approach for SEE Rate Calculation,” IEEE Transactions on Nuclear
ing,” 2006 Fifth International Symposium on Parallel and Distributed Science, Vol. 53, No. 4, 2006, pp. 1787–1793.
Computing, IEEE Publ., Piscataway, NJ, 2006, pp. 161–167. https://doi.org/10.1109/TNS.2006.872153
https://doi.org/10.1109/ISPDC.2006.33 [74] Samaras, A., Bezerra, F., Lorfevre, E., and Ecoffet, R., “CARMEN-2: In
[59] Gautschi, M., Schiavone, P. D., Traber, A., Loi, I., Pullini, A., Rossi, D., Flight Observation of Nondestructive Single Event Phenomena on
Flamand, E., Gürkaynak, F. K., and Benini, L., “Near-Threshold RISC- Memories,” 2011 12th European Conference on Radiation and Its
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916
V Core With DSP Extensions for Scalable IoT Endpoint Devices,” IEEE Effects on Components and Systems, IEEE Publ., Piscataway, NJ,
Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, 2011, pp. 839–848.
No. 10, 2017, pp. 2700–2713. https://doi.org/10.1109/RADECS.2011.6131314
https://doi.org/10.1109/TVLSI.2017.2654506 [75] Bacchini, A., Furano, G., Rovatti, M., and Ottavi, M., “Total Ionizing
[60] Dabbelt, D., Schmidt, C., Love, E., Mao, H., Karandikar, S., and Dose Effects on DRAM Data Retention Time,” IEEE Transactions on
Asanovic, K., “Vector Processors for Energy-Efficient Embedded Nuclear Science, Vol. 61, No. 6, 2014, pp. 3690–3693.
Systems,” Proceedings of the Third ACM International Workshop on https://doi.org/10.1109/TNS.2014.2365532
Many-Core Embedded Systems, Assoc. for Computing Machinery, [76] Kumar, A., and Sawitzki, S., “High-Throughput and Low-Power Archi-
New York, 2016, pp. 10–16. tectures for Reed Solomon Decoder,” Conference Record of the Thirty-
https://doi.org/10.1145/2934495.2934497 Ninth Asilomar Conference on Signals, Systems and Computers, IEEE
[61] Stephens, N., Biles, S., Boettcher, M., Eapen, J., Eyole, M., Gabrielli, Publ., Piscataway, NJ, 2005, pp. 990–994.
G., Horsnell, M., Magklis, G., Martinez, A., Premillieu, N., Reid, A., https://doi.org/10.1109/ACSSC.2005.1599906
Rico, A., and Walker, P., “The ARM Scalable Vector Extension,” IEEE [77] Udipi, A. N., Muralimanohar, N., Chatterjee, N., Balasubramonian, R.,
Micro, Vol. 37, No. 2, 2017, pp. 26–39. Davis, A., and Jouppi, N. P., “Rethinking DRAM Design and Organi-
https://doi.org/10.1109/MM.2017.35 zation for Energy-Constrained Multi-Cores,” SIGARCH Computer
[62] Shimizu, T., “Post-K Supercomputer with Fujitsu’s Original CPU, Architecture News, Vol. 38, No. 3, 2010, p. 175–186.
A64FX Powered by Arm ISA,” 2018, https://www.fujitsu.com/global/ https://doi.org/10.1145/1816038.1815983
Images/post-k_supercomputer_with_fujitsu% [78] Hanho, L., “High-Speed VLSI Architecture for Parallel Reed-Solomon
27s_original_cpu_a64fx_powered_by_arm_isa.pdf. Decoder,” IEEE Transactions on Very Large Scale Integration (VLSI)
[63] Lee, Y., Ou, A., Schmidt, C., Karandikar, S., Mao, H., and Asanovic, K., Systems, Vol. 11, No. 2, 2003, pp. 288–294.
“The Hwacha Microarchitecture Manual, Version 3.8.1,” Electrical https://doi.org/10.1109/TVLSI.2003.810782
Engineering and Computer Sciences Dept., Univ. of California TR [79] Shayan, Y. R., and Le-Ngoc, T., “A Cellular Structure for a Versatile
UCB/EECS-2015-263, Berkeley, CA, 2015, https://www2.eecs Reed-Solomon Decoder,” IEEE Transactions on Computers, Vol. 46,
.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-263.html. No. 1, 1997, pp. 80–85.
[64] “RISC-V ‘V’ Vector Extension, Version 0.9,” 2020, https://github. https://doi.org/10.1109/12.559805
com/riscv/riscv-v-spec/releases/download/0.9/riscv-v-spec-0.9.pdf [80] Li, G., Hari, S. K. S., Sullivan, M., Tsai, T., Pattabiraman, K., Emer, J.,
[retrieved 2 July 2020]. and Keckler, S. W., “Understanding Error Propagation in Deep Learning
[65] Chen, C., Xiang, X., Liu, C., Shang, Y., Guo, R., Liu, D., Lu, Y., Hao, Z., Neural Network (DNN) Accelerators and Applications,” Proceedings
Luo, J., Chen, Z., Li, C., Pu, Y., Meng, J., Yan, X., Xie, Y., and Qi, X., of the International Conference for High Performance Computing,
“Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of- Networking, Storage and Analysis, Assoc. for Computing Machinery,
Order 64-Bit High Performance RISC-V Processor with Vector Exten- New York, 2017.
sion : Industrial Product,” 2020 ACM/IEEE 47th Annual International https://doi.org/10.1145/3126908.3126964
Symposium on Computer Architecture (ISCA), IEEE Publ., Piscataway, [81] Zhang, Z., Huang, L., Huang, R., Xu, W., and Katz, D. S., “Quantifying
NJ, 2020, pp. 52–64. the Impact of Memory Errors in Deep Learning,” 2019 IEEE
https://doi.org/10.1109/ISCA45697.2020.00016 International Conference on Cluster Computing (CLUSTER), IEEE
[66] Louis, M. S., Azad, Z., Delshadtehrani, L., Gupta, S., Warden, P., Reddi, Publ., Piscataway, NJ, 2019, pp. 1–12.
V. J., and Joshi, A., “Towards Deep learning Using TensorFlow Lite on https://doi.org/10.1109/CLUSTER.2019.8890989
RISC-V,” Third Workshop on Computer Architecture Research with [82] Kosinski, B., and Dodson, K., “Key Attributes to Achieving >99.99
RISC-V (CARRV), 2019, Paper 7, https://carrv.github.io/2019/papers/ Satellite Availability,” 2018 IEEE International Reliability Physics
carrv2019_paper_7.pdf. Symposium (IRPS), IEEE Publ., Piscataway, NJ, 2018, pp. 6A.3-1–
[67] Lee, Y., Waterman, A., Avizienis, R., Cook, H., Sun, C., Stojanović, V., 6A.3-10.
and Asanović, K., “A 45 nm 1.3 GHz 16.7 Double-Precision GFLOPS/ https://doi.org/10.1109/IRPS.2018.8353620
W RISC-V Processor with Vector Accelerators,” ESSCIRC 2014—40th [83] Gee, J. D., and Smith, A. J., “Vector Processor Caches,” Electrical
European Solid State Circuits Conference (ESSCIRC), IEEE Publ., Engineering and Computer Sciences Dept., Univ. of California, TR
Piscataway, NJ, 2014, pp. 199–202. UCB/CSD-92-707, Berkeley, CA, Oct. 1992, http://www2.eecs
https://doi.org/10.1109/ESSCIRC.2014.6942056 .berkeley.edu/Pubs/TechRpts/1992/6251.html.
[68] Mukherjee, S. S., Weaver, C., Emer, J., Reinhardt, S. K., and Austin, T., [84] Yang, Q., “Introducing a New Cache Design into Vector Com-
“A Systematic Methodology to Compute the Architectural Vulnerability puters,” IEEE Transactions on Computers, Vol. 42, No. 12, 1993,
Factors for a High-Performance Microprocessor,” Proceedings. 36th pp. 1411–1424.
Annual IEEE/ACM International Symposium on Microarchitecture, https://doi.org/10.1109/12.260632
2003. MICRO-36, IEEE Publ., Piscataway, NJ, 2003, pp. 29–40. [85] “RISC-V ‘V’ Vector Extension, Version 0.8,” 2019, https://github.com/
https://doi.org/10.1109/MICRO.2003.1253181 riscv/riscv-v-spec/releases/download/0.8/riscv-v-spec-0.8.pdf
[69] Ebrahimi, M., Evans, A., Tahoori, M. B., Costenaro, E., Alexandrescu, [retrieved 5 Nov. 2020].
D., Chandra, V., and Seyyedi, R., “Comprehensive Analysis of [86] Abtahi, T., Shea, C., Kulkarni, A., and Mohsenin, T., “Accelerating
Sequential and Combinational Soft Errors in an Embedded Processor,” Convolutional Neural Network With FFT on Embedded Hardware,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
and Systems, Vol. 34, No. 10, 2015, pp. 1586–1599. Vol. 26, No. 9, 2018, pp. 1737–1749.
https://doi.org/10.1109/TCAD.2015.2422845 https://doi.org/10.1109/TVLSI.2018.2825145
570 DI MASCIO ET AL.
[87] Wang, S., Hu, J., and Ziavras, S. G., “On the Characterization of Data [89] Fernández, M., Gioiosa, R., Quiñones, E., Fossati, L., Zulianello, M.,
Cache Vulnerability in High-Performance Embedded Microproces- and Cazorla, F. J., “Assessing the Suitability of the NGMP Multi-Core
sors,” 2006 International Conference on Embedded Computer Systems: Processor in the Space Domain,” Proceedings of the Tenth ACM
Architectures, Modeling and Simulation, IEEE Publ., Piscataway, NJ, International Conference on Embedded Software, Assoc. for Comput-
2006, pp. 14–20. ing Machinery, New York, 2012, pp. 175–184.
https://doi.org/10.1109/ICSAMOS.2006.300803 https://doi.org/10.1145/2380356.2380389
[88] Sadler, N. N., and Sorin, D. J., “Choosing an Error Protection Scheme
for a Microprocessor’s L1 Data Cache,” 2006 International Conference Z. Sunberg
on Computer Design, IEEE Publ., Piscataway, NJ, 2006, pp. 499–505. Associate Editor
https://doi.org/10.1109/ICCD.2006.4380862
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916