0% found this document useful (0 votes)

16 views

Di Mascio Et Al 2021 On Board Decision Making in Space With Deep Neural Networks and Risc V Vector Processors

This document discusses using deep neural networks (DNNs) for on-board decision making in space systems. It analyzes the impact of DNNs on enabling new capabilities for satellites. Graphics processing units (GPUs) commonly used for DNNs on Earth are not well-suited for space due to high radiation vulnerability. The document proposes a RISC-V vector processor design with optimized memory subsystem to enable energy-efficient on-board decision making for both payload and platform applications on satellites using DNNs.

Uploaded by

Boul chandra Garai

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Di Mascio Et Al 2021 On Board Decision Making in Space With Deep Neural Networks and Risc V Vector Processors

Uploaded by

Boul chandra Garai

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

JOURNAL OF AEROSPACE INFORMATION SYSTEMS

Vol. 18, No. 8, August 2021

On-Board Decision Making in Space with Deep Neural

Networks and RISC-V Vector Processors

Stefano Di Mascio,∗ Alessandra Menicucci,† and Eberhard Gill‡

Delft University of Technology, 2629 HS Delft, The Netherlands
and
Gianluca Furano§ and Claudio Monteleone¶
European Space Agency, 2200 AG Noordwijk, The Netherlands
https://doi.org/10.2514/1.I010916
The use of deep neural networks (DNNs) in terrestrial applications went from niche to widespread in a few years,
thanks to relatively inexpensive hardware for both training and inference, and large datasets available. The
applicability of this paradigm to space systems, where both large datasets and inexpensive hardware are not
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

readily available, is more difficult and thus still rare. This paper analyzes the impact of DNNs on the system-level
capabilities of space systems in terms of on-board decision making (OBDM) and identifies the specific criticalities of
deploying DNNs on satellites. The workload of DNNs for on-board image and telemetry analysis is analyzed, and the
results are used to drive the preliminary design of a RISC-V vector processor to be employed as a generic platform to
enable energy-efficient OBDM for both payload and platform applications. The design of the memory subsystem is
carried out in detail to allow full exploitation of the computational resources in typically resource-constrained space
systems.

I. Introduction environment) is proportional to its area (i.e., the number of sequential

elements when considering only upsets in sequential elements) [6].
T HE success of deep neural networks (DNNs) for terrestrial
applications has been mainly due to the availability of large
datasets (i.e., rise of “big data”) and the availability of relatively
Therefore, even employing a Rad Hard By Design (RHBD) technol-
ogy, a GPU is expected to fail almost three orders of magnitude more
inexpensive hardware that can run learning and inference in reason- often than a state-of-the-art space processor.
able timescales, for instance, graphics processing units (GPUs) [1]. A larger soft error vulnerability is not the only reason why simple
The space industry looks at this phenomenon with interest, although microarchitectures with low parallelism are still the vast majority of
the availability of large datasets for space applications is limited and processors employed in space. As a matter of fact, most of the tasks
the hardware employed in space applications lags behind in terms of executed by processors in space data systems are non-compute-
performance compared with its commercial counterpart. intensive workloads; i.e., they perform a low number of operations
One of the main issues in terms of hardware faced by the space per byte read from and written to memory. The reason is that they are
mainly employed for nondemanding control and housekeeping oper-
industry is that it is not possible to reuse in a straightforward way the
ations, whereas on-board data processing typically is not an attractive
hardware platforms employed in terrestrial applications, given the
solution, because it can be executed in most cases on ground with
specific constraints of satellite data systems especially in terms of
much less expensive machines for a given computational need
robustness to ionizing radiation [2]. For instance, the GPU tested in
(unless it allows for improved capabilities of the satellite, e.g., data
[3] is reported to fail during an irradiation with high-energy proton
encryption or compression).
beam roughly every 43 s. The main reason behind this very low mean
It is still matter of discussion whether artificial intelligence (AI) is
time to failure (MTTF) is that GPUs are much larger (e.g., 2.2 billion
of actual interest in space applications and whether it will be feasible
transistors,** which corresponds to roughly 550 MGE if we assume
to deploy it systematically on-board satellites in the next 10–15 years.
four transistors per GE††) than single-core, single-issue processors
Therefore, the first goal of this paper is to carry out an analysis of the
(890 kGE for the one in [5]) typically employed in space. As a matter
impact and requirements of DNNs (the most successful form of AI in
of fact, the failure rate of a processor (given a certain technology and
terrestrial applications) in space systems. To try to meet these require-
ments, the space industry is following three main approaches:
Received 14 October 2020; revision received 7 April 2021; accepted for 1) Work is being done to map efficiently DNNs on resource-
publication 1 May 2021; published online Open Access 24 June 2021. constrained state-of-the-art space processors [7], accepting a consis-
Copyright © 2021 by the authors. Published by the American Institute of tent loss of performance compared with DNNs in high-performance
Aeronautics and Astronautics, Inc., with permission. All requests for processors for terrestrial applications. This approach can exploit
copying and permission to reprint should be submitted to CCC at www. synergies with the trend in Internet of Things (IoT) of implementing
copyright.com; employ the eISSN 2327-3097 to initiate your request. See
DNNs in low-power and resource constrained processors [8].
also AIAA Rights and Permissions www.aiaa.org/randp.
*Ph.D. Candidate, Faculty of Aerospace Engineering, Space Systems
2) High-performance proprietary commercial-off-the-shelf (COTS)
Engineering; [email protected] (Corresponding Author). processors employed in terrestrial applications are being proposed [9].
†
Assistant Professor, Faculty of Aerospace Engineering, Space Systems Although they can achieve higher-order magnitude performance com-
Engineering; [email protected]. pared with state-of-the-art space processors [10], they come with a
‡
Professor, Faculty of Aerospace Engineering, Space Systems Engineer- large “cost of ownership” to avoid losses in terms of dependability
ing; [email protected]. [11], and possible restrictions on its usage and knowledge of internal
§
On-Board Computer Engineer, Microelectronics and Data Systems behavior [4].
Division, European Space Technology Centre, Keplerlaan 1; gianluca. 3) Field programmable gate arrays (FPGAs) allow the design of a
[email protected]. customized hardware accelerator, typically connected to either a hard
¶
On-Board Computer Engineer, Microelectronics and Data Systems
Division, European Space Technology Centre, Keplerlaan 1; claudio. or soft processor through an interconnect [10]. The accelerator can be
[email protected]. either handcrafted in hardware description languages (HDLs) or
**https://www.techpowerup.com/gpu-specs/radeon-e9173-pcie.c3031. autogenerated from software, after profiling to identify the most
†† computational intensive functions. In [12] it is shown that Vivado
A gate equivalent (GE) is a technology-independent unit of measure of the
area of a design (normalized to a reference 2-input NAND gate) [4]. HLS with enabled optimizations (i.e., pipelining and concurrent
553
554 DI MASCIO ET AL.

execution of operations) achieves a 6.23× speed-up for a small A. Downlink Efficiency

convolutional neural network (CNN) and 9× for a larger CNN on In [20] it is shown that, assuming a transmission power of 1 W for a
a Zynq compared with the software implementation on its hard CubeSat in LEO, a maximum data rate of 512 kbps can be obtained
processor (which can be considered roughly equivalent to space with a ultra-high-frequency (UHF) downlink (BER < 10−5 ). On
processors). However, in space applications this approach cannot the other hand, in [20], an image from a simple VGA camera with
be typically employed for application-specific integrated circuits 640 × 480 pixels is 900 KiB, and a cube from a hyperspectral sensor
(ASICs), given the niche-sized market available. Furthermore, in with 32 bands and 1024 × 1024 pixels is instead 32 MiB. In LEO a
[13] it is noted that only 25% on average is spent waiting on accel- potential access duration to the ground station can be 5 min every
erator computations, with the rest of the time taken up by data orbit (around 90 min in [21]), and in this time span, only around 21
transfers (34%) and processor computations (42%). VGA images or roughly half a cube of 32 bands can be transmitted to
The second goal of this paper is to show that a fourth approach can the ground. Even considering the relatively large 6U CubeSat
be followed, as performances of space processors can be substantially described in [22], the power budget for the downlink limits the data
improved with data-level parallelism (DLP) to achieve performance rate to 14 Mbps, whereas its spectrometer (with five spectral bands)
of the same order of magnitude of high-performance terrestrial generates 255 Mbps. Without considering data compression, during a
processors for DNNs. To do this, we develop in more detail the study single ground station pass, only up to 31.5 s worth of imaging data
of the RISC-V processors needed to enable on-board decision mak- can be downlinked, which is only around 0.6% of the data that can be
ing (OBDM) carried out in [4], focusing on the preliminary design of collected during a typical LEO orbit. Although it is not realistic to
a RISC-V vector processor specifically for space applications. This assume that the spectrometer operates continuously during the mis-
preliminary design is intended to serve as a baseline for future works, sion, this shows that there is a mismatch between the capability of a
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

during the Very High Speed Integrated Circuit Hardware Description small satellite in LEO to generate data and its capability to transmit
Language (VHDL) implementation of a RISC-V vector processor for data to the ground.
space applications based on the NOEL-V platform (developed by
Cobham Gaisler) [14]. 1. Benefits of Data Removal and Compression
RISC-V is an instruction set architecture (ISA) that is rapidly
growing in popularity in both terrestrial and space applications [4]. Given the tight power budgets and the expensive hardware
Its main characteristics are simplicity, openness (being a free and required for on-board data processing, data processing is typically
open standard allows open-source implementations), and modularity executed on ground. For instance, noise filtering can be executed on
(i.e., composed of a base ISA and many optional ISA extensions). ground with cheaper hardware. On the other hand, sometimes on-
Among the many ISA extensions defined in the standard, the RISC-V board data processing provides an advantage over on-ground data
Vector Extension (RVVE) is being proposed to provide general processing in terms of satellite performance. For instance, data
support for data-parallel execution [15]. compression is already deployed in many missions (e.g., in [22] a
The paper starts by analyzing the benefits that DNNs can provide at 2:1 compression is employed) because it mitigates the bottleneck of
the system level and the feasibility of the deep learning approach for the downlink. The efficiency of the downlink can be increased even
space applications (Sec. II). Then, an analysis of the software work- further, removing useless data instead of sending it to the ground
loads required for DNNs is carried out in Sec. III. The information (i.e., data removal [23]). For instance, in the Landsat datasets [24], the
collected is then used to define a suitable hardware platform (Secs. IV average cloud cover in an archived scene is 34%, with 38% of the
and V). To account for both computational and memory constraints, scenes containing less than 10% cloud cover. Therefore, selecting
separate discussions are carried out for the microarchitecture of the only images with less than 10% of cloud cover results in an average of
processing core (Sec. IV) and its memory subsystem (Sec. V). 2.63× data reduction. Combining data removal with a 2:1 compres-
Finally, Sec. VI concludes with a summary of the main findings sion, the amount of useful data sent increases by 5.26× compared
and several recommendations to systematically enable OBDM with with a system without on-board data processing.
RISC-V vector processors in the medium-term.
2. Cost of Required Hardware
When DNNs and other data processing algorithms are to be
II. Impact at System Level deployed on data produced by instruments, a payload processor is
The focus of the space industry in recent years shifted from large required to process the data. Although memories with long retention
geostationary orbit (GEO) satellites to small (< 500 kg) low-Earth- time and low power dissipation (e.g., flash memories) can be
orbit (LEO) satellites (especially CubeSats) [16,17]. employed for mass memories, faster memories are required to act
While GEO satellites can continuously communicate with the as main memory of the payload processors. Typically dynamic
ground station, LEO satellites can only communicate with the ground random-access memory (DRAM) arrays are chosen, ranging from
station periodically, sometimes with large periods between contacts single data rate (SDR) to double data rate 2 (DDR2) to double data
[18]. In this way, the satellite may enter an unsafe state and the ground rate 3 (DDR3), depending on the radiation resilience/performance
operator in the worst case can only intervene hours later. tradeoff required [25]. From the datasheet [26] of the 1 Gb DDR2
However, there is a trend of launching LEO satellites in constella- DRAM tested in [25], a peak power consumption of around 0.5 W
tions and mega constellations [19], with the possibility of mitigating can be taken as an estimation of power consumption, and 1 W for the
the risk of failure of a single satellite and replace them if they fail most powerful version of the vector processor in [27] running a
(as they are much cheaper than large GEO satellites). There is there- peak-performance application. Assuming a requirement of 1 GiB
fore a tradeoff to be made between dependability of a single satellite, of main memory, we consider 5 W as the cost in terms of power PP of
its cost, and number of spare satellites. applying data reduction and compression. As a comparison, 1U
Furthermore, space systems are inherently constrained in terms of CubeSats and 3U CubeSats in [20] generate, respectively, 1–2 W
power available (e.g., only a limited surface is available to collect and 5–6 W, whereas the 6U CubeSat in [22] generates around 20 W.
power). Limited power implies that the data rate of the downlink Assuming a common amount of power allocated for the trans-
given a certain target bit error rate (BER) is also limited, as the data mission and data processing subsystems (PTP ), we can estimate
rate is proportional to the power employed during the transmis- the amount of useful data transmitted per station contact DC when
sion [20]. data are not processed as DC PTP ∕RR k, where k is a
Therefore, small satellites in LEO pose new challenges both in constant (dependent on the transmission subsystem, receiver,
terms of amount of data that can be transmitted to the ground and in propagation, and required BER) [20] and RR is the optimal
terms of dependability. In the following two subsections we will show removal rate, i.e., the ratio between useful data and data produced
how OBDM can help mitigating these shortcomings of LEO satel- by the payload. When only useful data are selected and a data
lites. In Sec. II.C the feasibility of applying DNNs to these problems compression of CR:1 is applied, the amount of useful data trans-
is investigated. mitted is instead DC CR PTP − PP k. The ratio R between
DI MASCIO ET AL. 555

the amount of useful data transmitted in the two cases is then

R RR CR PTP − PP ∕PTP ), which for PTP ≫ PP tends to
its maximum, i.e., RR CR. This means that data removal is more
effective for larger satellites, which have more power available for
transmission and processing.
To give an idea of what the effect of a more power-efficient solution
would be, in Fig. 1 the ratio R depending on the power budget PTP for
two different values of power spent for data processing PP (5 and
2.5 W) is shown. While the fraction of the maximum ratio achieved
for a certain PTP is independent from RR CR, it depends on PP . As
a matter of fact, it takes a larger PTP to achieve a certain fraction of the
maximum improvement possible when PP is increased.

B. Virtual Operator
In [21] it is assumed that a LEO satellite has an orbit duration of
90 min and that there is a contact with the ground station either 5 min
every orbit (6% of the orbital period) or every 5 orbits (1%). In a
similar scenario, the idea of an on-board virtual operator monitoring
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

the status of the satellite and taking autonomous decisions when no

communication with the ground operator is possible becomes of great
interest. The on-board virtual operator can, for instance, enable Fig. 2 Increase of safety for a virtual operator with DF 0.9 for
autonomous failure detection (and forecasting) and autonomous safe contact of 5 min with ground station every 90 min and every 450 min.
mode management. For instance, DNNs can be employed to predict Dashed lines represent average values.
the telemetry of the next orbit given the previous one (or more) [28].
This can be used to help diagnose anomalous behaviors before the satellites to achieve typical requirements for dependable systems
next contact with the ground station [28]. (at least 99.9%).
1. Benefits of an On-Board Virtual Operator 2. Cost of the Required Hardware
Assuming a constant failure rate λ (typical of soft errors [6]), the The most attractive solution to deploy telemetry analysis and
reliability of the spacecraft after the end of the ith contact and before forecasting is to use an enhanced version of a typical on-board
(i 1)th contact can be expressed as Rt Rti e−λt−ti , where ti computer (OBC). As a matter of fact, the requirements for this kind
is the time instant of the end of the ith contact. Assuming a ground of applications are less stringent compared with DNNs for image
operator capable of handling safely the failures of the spacecraft and a analysis. In [28], telemetry forecasting was implemented with a
satellite not capable of handling safely failures in autonomy, the 64 Mb DRAM on a single core reaching 661 predictions per second,
safety (we define safety as the percentage of time a satellite is work- meaning that all the parameters of the satellite telemetry in [28]
ing in nominal conditions or it is in a safe state because of a detected (13,216 in total) can be predicted in around 20 s. It is enough to
failure) St is 100% during contact and is St Rt when the execute this computation once per orbit to predict the telemetry of the
satellite is not in contact. When considering an on-board virtual following orbit, taking only 0.4% of an orbit period. Furthermore,
operator, a percentage of failures is detected with a certain detection enhancing a general purpose in processor with minimal vector facili-
factor (DF), then St Rti e−1−DFλt−ti when not in contact. ties (i.e., two lanes) increases power consumption of the processor
Figure 2 shows an example for DF 0.9, a λ of 10e-4 failures/min from 52 mW [5] to 138 mW [27] (65%).
(approximately one failure per week), and two different time periods
between contacts (90 and 450 min). The average safety increases C. Feasibility of the DNN Approach
from 99.60 to 99.96% in the first case and from 97.83 to 99.78% in DNNs proved to be the best approach in classification and pre-
the second. Although this is a simple model and some on-board diction when large datasets are available for training, reaching accu-
failure detection capabilities are possible without DNNs, it shows racies close to or slightly above human level [29]. When the training
that improving the on-board capabilities of a satellites can help LEO set is not large enough, other machine learning approaches or human-
defined DSP algorithms may instead achieve better results. The
degradation of DNN accuracy for small datasets is shown, for
instance, in [30], where CNNs are trained for multiclass classification
with different datasets sizes. It is shown that, for a three-class clas-
sification problem, using 5000 images per class achieves 97% accu-
racy in average, bringing the training set down to 1000 images per
class lowers accuracy to 74% in average. When the problem becomes
more complicated (i.e., larger number of classes), even using 45,000
images in total achieves an average accuracy below 94% (nine
classes) [30].
One of the most popular datasets for terrestrial applications is
ImageNet.‡‡ It is a large image dataset typically used to assess the
effectiveness of a certain neural network architecture for image
classification, containing RGB images of 256 × 256 pixels for a total
of 1000 classes [29]. There are 1.3M training images (ranging from
732 to 1300 per class) and 100,000 test images [29]. Large reference
datasets available to the public are much less common for space
applications. One of the most popular is the public Landsat 8 data-
set,§§ which provides hyperspectral images composed of 11 bands
Fig. 1 Ratio R between useful data transmitted with and without data
‡‡
removal against PTP allocated for the transmission subsystem and data http://www.image-net.org.
removal for different PP and RR. In all cases 2:1 compression is assumed. §§
https://www.usgs.gov/land-resources/nli/landsat/landsat-8.
556 DI MASCIO ET AL.

ranging from ultra-blue to thermal infrared. A large number of

land cover classification solutions were developed on subsets of the
Landsat datasets [31].
Setting up a reference, standardized dataset is instead more
difficult for more specific applications, especially those involving
inner behavior of the satellite, like telemetry forecasting or anomaly
detection. As a matter of fact, design parameters like orbits, observed
signals, and nominal values change from mission to mission.
Furthermore, major prime contractors have very restrictive data
policies concerning open access to telemetry data. However, some
datasets of telemetries are available to the public, like those of the
GOCE mission.¶¶ Even in this case, it is difficult to pinpoint anoma-
lies, as information about them is typically not shared by the mission
teams with the public. However, public datasets can help to study the
feasibility of telemetry forecasting, as done in [28]. Fig. 3 Breakdown of the execution time for an inference of CloudNet.
In the future, the idea of deploying telemetry analysis on-board
will have to face the problem of relying on ad-hoc datasets for
specific applications to use for training and testing. One option is to A. CloudNet
wait for a certain period of nominal operation of a satellite and use As a case study of DNN for image analysis, we will consider the
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

the past telemetry to train the network on ground and then uplink the public code††† of CloudNet [34]. It is a fully convolutional network
trained network in software. When the telemetry forecasting is to be (FCN) [35] for cloud detection; i.e., its output is a mask of the same
deployed on a constellation composed of replicas of the same size of the input image indicating the pixels covered with clouds. The
satellite, more statistics for larger datasets are available. As reported use of an FCN instead of a CNN helps in mapping efficiently the
in [19], existing and planned constellations comprise hundreds to DNN in resource-constrained hardware, as it is possible to work on
thousands satellites (e.g., 4200 for the planned constellation from patches of a large image without the need of working on the entire
Samsung), thus making DNNs potentially very effective also for image. The fraction of bits covered in clouds can then be averaged on
mission-specific parameters. the ∼400 patches. In the case of CloudNet four spectral bands of the
large images of Landsat 8 (e.g., 7621 × 7791 pixels) are divided in
nonoverlapping patches of 384 × 384 pixels, which are then down-
III. Workloads Analysis sampled to 192 × 192 pixels.
Analyzing the model in Keras,‡‡‡ we find that CloudNet contains
The run time of compute-intensive workloads composed of a 38 two-dimensional convolutional layers (of which 5 are transposed),
certain amount of floating point calculations is typically expressed 15 addition layers, 31 batch normalization layers, 45 standalone
in terms of number of floating point operations (FLOPs) per second activation layers, and 53 concatenate layers. To give an idea of the
(FLOP/s) or number of FLOPs per clock cycle (FLOP/CC).*** The contribution of each of these layers, we profiled the execution of the
number of FLOP/CC that can be achieved by a certain hardware model on a quad-core Intel i7-6600U. The breakdown of the execu-
platform has an upper bound defined by the number of functional tion type for each type of layer is shown in Fig. 3, and considerations
units and the amount of operations these units can perform simulta- on each of them are carried out in the remainder of this section.
neously. We call this upper bound maximum theoretical performance Furthermore, running a single inference per time requires a peak
per clock cycle (MTPCC ). MTPCC is independent of any other memory of 836.65 MiB. This value is compatible with values found
microarchitectural feature, like instruction-level parallelism (ILP), in literature for other DNNs, typically ranging from 645 MiB to 1.49
speculation, and caching. However, it is not possible to achieve GiB [36].
#FLOP∕CC ≈ MTPCC for every workload, as data are to be fetched
from memory, and in some cases this cannot be done fast enough to
1. Convolutional Layers
keep the functional units busy all the time. To visualize whether a
workload can achieve the MTPCC (compute-bounded workloads) or As shown in Fig. 4, applying a convolutional layer with N kernels,
the performances are bound from the memory bandwidth (memory- each of dimensions C × J × K, kernels to an input of dimensions C ×
bounded workloads), the roofline model was introduced in [32]. W × H generates an output of N matrices, each of dimensions U × V
According to this model, the fraction of MTPCC that can be achieved [37], with U and V depending on the stride S and padding P of the
by a workload on a certain platform depends on the operational convolutional layer with the equation (an analogous relationship
intensity (OI) of the workload, which is holds replacing W, J, and U with, respectively, H, K, and V) [38]:

#FLOP U bW − J 2P∕Sc 1 (2)

OI (1)
MT

where MT is the memory traffic composed of the read traffic RT plus As straightforward software implementations of convolutions
the write traffic WT. For each hardware platform there is an OI for achieve low performance, performances are typically improved
which workloads are memory-bounded if OI < OI (therefore the unrolling the convolutions into matrix–matrix multiplications [39].
performances are limited to #FLOP∕CC BW OI, where BW is In this case, the number of FLOPs for each layer is estimated as
the bandwidth of the memory) and compute-bounded if OI > OI #FLOP 2UVNCJK, given that there are UVN output elements
(where achieving the MTPCC is actually possible with microarchi- and for each of them CJK multiplications and accumulations are
tecture and software optimizations). Although based on several required. The read traffic is then RT 4NCJK UVCJK and the
assumptions, for instance, that it is possible to overlap memory write traffic is WT 4NUV. In Table 1 we show the size of the unroll
transfers and computations [33]), the roofline model is a successful of the convolution for only the first 15 layers (for sake of brevity) of
tool to benchmark processors in an application-independent way, the network. Some observations can be made:
mainly focusing on the performance of popular kernels (e.g., [27]). 1) OIs are large (in the order of tens of FLOP/B), except for
convolutions with K 1 for which OI can go down to 1.60 FLOP/B.
¶¶
https://goce-ds.eo.esa.int/oads/access/collection/GOCE_Telemetry/.
†††
***Normalizing by frequency is a common procedure to obtain technol- https://github.com/SorourMo/Cloud-Net-A-semantic-segmentation-
ogy-independent metrics that measure the effectiveness of a certain micro- CNN-for-cloud-detection.
‡‡‡
architecture. https://keras.io.
DI MASCIO ET AL. 557
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

Fig. 4 Unrolling of a convolution (similarly to Ref. [37]).

Table 1 Workload characterization for the first 15 layers of CloudNet [34]

Convolution 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
WH 192 192 192 192 96 96 96 48 48 48 24 24 24 12 12
C 4 16 16 32 32 32 64 64 64 128 128 128 256 256 512
KJ 3 3 3 1 3 3 1 3 3 1 3 3 1 3 3
N 16 32 16 32 64 32 64 128 64 128 256 128 256 512 512
UV 192 192 192 192 96 96 96 48 48 48 24 24 24 12 12
RT [MiB] 5.06 20.27 20.26 40.54 10.20 10.16 20.39 5.34 5.20 10.69 3.66 3.09 7.31 5.77 11.53
WT [MiB] 2.25 4.50 2.25 4.50 2.25 1.13 2.25 1.13 0.56 1.13 0.56 0.28 0.56 0.28 0.28
MT [MiB] 7.31 24.77 22.51 45.04 12.45 11.29 22.64 6.47 5.77 11.81 4.22 3.38 7.88 6.05 11.81
#MFLOP 42.5 340 170 75.5 340 170 75.5 340 170 75.5 340 170 75.5 340 679
OI [FLOP/B] 5.54 13.08 7.20 1.60 26.03 14.36 3.18 50.09 28.10 6.10 76.80 48.00 9.14 53.58 54.86

2) Even if OI is large and therefore the workloads can be assumed n21 1 2n2
to be compute-bounded, the absolute amount of memory traffic is OI
8n21 n1 n2
very high (3–45 MiB per layer). These values require a dedicated
design of the memory subsystem compared with processors for non-
compute-intensive workloads, which will be carried out in Sec. V. Assuming that 2n2 ≫ 1, OI ≈ n1 n2 ∕4n1 n2 ,which given a
3) The memory traffic is for a large majority composed by reads certain memory traffic (i.e., n1 n2 const) is maximized for
(92.83% in average). n1 n2 , reaching OI ≈ n1 ∕8. As OI is proportional to the size of
Further performance enhancements can be obtained by mapping the output matrix, SGEMM will eventually achieve the peak perfor-
the matrix–matrix multiplication with optimized libraries. In [39] it is mance for a large enough matrix on a given hardware platform. For
shown that using basic linear algebra subroutines (BLAS) instead of this reason, the SGEMM efficiency
coding the unrolled version from scratch produces a speed-up rang-
ing from 2.43× to 3× depending on the architecture and on the input FLOP∕CC
ESGEMM
size. Using BLAS subroutines, matrix–matrix multiplications are MTPCC
mapped to the SGEMM subroutine,§§§ which (in its nontransposed
form) implements the following algorithm: (i.e., the fraction of time the functional units of the processor are busy
when executing SGEMM) is typically given as a measure of attain-
able performance on a certain hardware platform [40]. When caching
A2←αA0 × A1 βA2 (3) levels are present, increasing the size of the matrix multiplications to
increase OI will eventually cause a drop in performance, as the
operands will not fit anymore in the cache level responsible of peak
performance and reads from lower levels (even main memory) are
where A0, A1, and A2 are matrices of, respectively, size n1 × n2 ,
required during the matrix multiplication, breaking the assumption of
n2 × n3 , and n1 × n3 , and α and β are scalars. Assuming α β 1
the roofline model that memory traffic and computation overlap. This
(as in the case of convolutions) and a square matrix at the output
issue is analyzed in Sec. V.
(n1 n3 ), SGEMM has

§§§
2. Concatenate and Addition
Analogous subroutines are defined for different data types, and the first
letter represents the data type. For instance, SGEMM is for single precision Given that CloudNet is very deep (38 convolutional layers), it
(SP), DGEMM is for double precision (DP), and IGEMM is for integers. In requires specific solutions in its architecture to mitigate the vanishing
this paper, data will be assumed to be SP unless specified otherwise; therefore gradient problem [41]. The designers of CloudNet handled this
SGEMM will be used. problem using skip connections, and addition and concatenation
558 DI MASCIO ET AL.

layers [34]. As can be seen in Fig. 3, although the impact of addition Therefore, OI reaches its maximum (0.5 FLOP/B) for very large
layers on the execution time is negligible (1.1%), concatenation CHW and N. To give an idea of how FC layers compare against
layers take a considerable part of the execution time (24.7%). Fur- convolutional layers, we compared the memory traffic and OI for the
thermore, concatenate operations contain no FLOPs and consist convolutional layers in Table 1 to FC layers with same C, W, H, and
mainly of memory transfers; therefore they cannot be sped up with N. The MT of the FC layers ranges between 1.31× and 18.36×
increased computation capabilities. These considerations suggest compared with the respective convolutional layer, whereas the OI
avoiding architectures concatenation and using skip connection is 3.3× to 154.2× smaller. The high MT associated with FC layers is
between layers with same dimensions (where concatenations are confirmed by [37], although a trend can be noticed: for early CNNs
not needed), as done in [41]. with few convolutional layers (e.g., AlexNet with five convolutional
layers and three FC layers) the percentage of parameters in the FC
3. Batch Normalization layers is very high (for AlexNet 96.07%), whereas state-of-the-art
Batch normalization layers are employed to speed up training and deeper CNNs (typically achieving higher accuracy) like ResNet have
increase accuracy of DNNs [42]. This type of layer also acts as a many convolutional layers (for ResNet the number of convolutional
regularizer, keeping the magnitude of the parameters low and avoid- layers ranges from 53 to 155 and typically only one FC layer is
ing overfitting [42,43]. When using batch normalization during present) and have a much lower percentage of parameters in the FC
inference, each element xi of the activation vector x from the previous layers (ranging, respectively, from 8.04 to 3.42%).
layers is normalized according to Furthermore, the performance for FC layers can be improved
employing batching, i.e., processing more input features in paral-
xi − Exi lel [37]. This technique is particularly effective in the case of
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

x^i p (4)

Varxi ϵ FC layers, as it allows reuse of the large amount of parameters
read from memory over several input features.¶¶¶ When processing
where Exi and Varxi are, respectively, the expected value and the B input features in parallel, the number of operations is
variance of xi (obtained from cumulative statistics collected during #FLOP 2NBCHW, the memory traffic is MT 4CHWN
training), and ϵ is a small constant to ensure convergence. The B BN and the operational intensity is
number of operations required for an activation vector of n elements
is 2n, and the number of elements to be read from and written to 1
OI
memory is 3n; therefore OI 1∕6 FLOP=B. This value shows that 21∕B 1∕N 1∕CHW
this layer is typically very memory-bounded, taking up a nonnegli-
gible part of the total execution time of CloudNet (5.5%). This equation shows that the effectiveness of batching
eventually saturates. For instance, for C 512, N 512 and
4. Activation Layers W H 12 without batching OI 0.5 FLOP/B. For small
Analyzing CloudNet in Keras, we found that a total of 84 activation batching, i.e., 1∕B ≫ 1∕N 1∕CHW, batching causes an
layers are present. However, only 48 are nonlinear (i.e., 36 activation almost linear increase of OI and OI ≈ B∕2 (e.g., OI 3.93
layers are composed of pass-through functions that do not have any FLOP/B for B 8). The effectiveness of batching saturates for
computational impact), of which 47 are rectified linear units (ReLUs) larger B until for very large batching an upper bound of
and only 1 is a sigmoid (at the last layer). ReLU functions have low
computational impact, as it is enough to set to 0 all the negative values 1
OImax
[44]. Sigmoids (and hyperbolic tangents) are computationally more 21∕N 1∕CHW
expensive, as in principle they require the calculation of a nonlinear
math function. A typical approach to implement them is to use a is reached (in this example around 254 FLOP∕B). A relatively high
lookup table [44] or a piecewise linear approximation [28]. value of B may be required to achieve an OI in the order of the tens
(e.g., 15.05 FLOP/B for B 32). Furthermore, batching introdu-
5. Subsampling Layers ces an extra latency, as to process a frame in the worst case B − 1
To reduce the number of computations and make features more successive input feature maps have to be calculated. This effect of
robust [45], typically convolutional layers are followed by subsam- batching can be an issue in real-time applications and is further
pling (or pooling) layers. In CloudNet all the subsampling layers are analyzed in Sec. III.C.
implemented with max pooling; i.e., the maximum value of a window Despite the described criticalities of FC layers, they typically have
is selected as the output value. This is common in state-of-the-art limited impact on the execution time of CNNs. For instance, in [46]
DNNs [37], although sometimes different approaches are employed, the breakdown of the execution time for inference according to the
such as average pooling (the output is the average of the values in the different type of layers is reported to be 90.7% for convolution layers,
window), a mix of max and average pooling, and stochastic pooling) 9.15% for subsampling layers, 0.03% for ReLU activation layers, and
[45]. Pooling can be either implemented as a nested for-loop over 0.11% for FC layers. The breakdown of the number of layers is
each window, or split into operations in one axis and then in the other instead 25% convolution layers, 20% subsampling layers, 40%
(which usually provides better performance [44]). activation layers, and 15% FC layers.

B. Other Layers in DNNs for Image Analysis C. DNNs for Telemetry Forecasting
When DNNs are employed for classification, the expected output Recurrent neural networks (RNNs) are typically employed in time
is typically a vector containing the probability of classification for a series analysis like speech recognition and natural language process-
certain object. In these cases, the last layers of the DNN after the ing (NLP) [47–49], and they can be applied, for instance, to early
convolutional layers are composed of fully connected (FC) layers to failure detection or to predict the telemetry of the next orbit given the
make a decision based on the information contained in groups of telemetry of previous orbits, as done in [28]. RNNs are composed by
pixels. This type of network is usually called CNN. FC layers can be a cascade of units with internal feedback, where each unit requires the
seen as convolutional layers where there is no sharing of coefficients, output of the previous one to be ready to calculate the next activation.
i.e., J K W H [37]. This implies that the output is a vector of Typically long short-term memory (LSTM) implementations are
size N, the number of operations is #FLOPs 2NCHW, the chosen to achieve higher accuracy, whereas gated recurrent unit
memory traffic is MT 4CHWN 1 N, and (GRU) implementations provide lower accuracy with higher
1
OI ¶¶¶
Batching is instead not effective with convolutions, as the amount of
21 1∕N 1∕CHW parameters in a convolution is very small (e.g., 3 × 3 × 16).
DI MASCIO ET AL. 559

a) No FMA, D=1, P=1 b) FMA, D=1, P=1 c) FMA, D=4, P=1 d) FMA, D=4, P=4
Fig. 5 Steps to increase the MTPCC of space processors in a power- and area-efficient way.

performance [28]. Furthermore, one or more FC layers are placed power consumption of a general-purpose scalar processor is spent on
before the output [28]. fetching instructions. For instance, the breakdown of energy dissi-
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

LSTM layers are typically memory-bounded [50]. Similarly to pation on a scalar processor executing IGEMM in [5] shows that the
[28], the linear part of the LSTM layer can be described as instruction cache dissipates 19.63% of the total energy, instruction
fetch and decode stages 4.69%, and the virtual memory (comprising
st W ⋅ xt U ⋅ ht−1 b (5) both instruction and data) 7.41%. A percentage of energy dissipation
ranging between around 24 and 32% can therefore be attributed to
where xt , ht−1 , and st are column vectors, respectively, of length m, n, instructions fetching and decoding. Data parallel processors reduce
and n. W and U are, respectively, [m × n] and [n × n]. Therefore the this fraction of power, defining instructions that operate on arrays of
#FLOPs is 2n2 n nm, the MT seen by main memory is D elements instead of scalar elements. Figure 5c shows an example
4n2 nm 3n m, and the OI is with D 4, which (together with FMA operations) achieves
MTPCC 8 FLOP/CC. However, DLP is the least flexible form of
#FLOP parallelism [53], as it can only be applied to calculations that can be
OI (6)
2#FLOP 2n m vectorized (i.e., expressed with instructions on vectors), e.g., matrix–
matrix multiplications in convolutional layers. As a matter of fact, in
with a maximum value of 0.5 FLOP/B for large matrices, i.e., [54], the speed-up found in the convolutional layers of a CNN using
#FLOP ≫ 2n m. This low value can be increased with batching, the data-parallel NEON extension over the baseline ARM ranges
as it turns matrix–vector multiplications into more computational from 2.45× to 2.78×, with a decrease of energy consumption per
intensive matrix–matrix multiplications (as B vectors are put together convolutional layer ranging from 59.11 to 82.04%. The energy
to create a matrix of dimensions [n × B]). In this case, efficiency of the data-parallel solution (i.e., performance in terms
of executed layers per amount of energy) is in this case then 5.98× to
#FLOP B 15.50× the energy efficiency of the non-data-parallel baseline. When
OI (7) the effectiveness of DLP saturates for large D, the solution left to
2#FLOP 3B − 1n m B
increase the MTPCC is to replicate the processing core. In Fig. 5d the
This equation shows that the efficacy of batching in terms core is replicated four times (P 4), achieving an MTPCC of 32
of increase of OI saturates as B grows, until the upper bound FLOP/CC (together with FMA and D 4). Going above four cores
of OImax #FLOP∕6n 2m is achieved. This upper bound is a typically reduces the utilization of the functional units. For instance,
relatively large value, for instance, 27.29 for m 27 and n 60 in [55] it is shown that with eight cores it is possible to obtain for
(typical values in [28]). However, OI cannot be increased arbitrarily CNNs’ performances ranging from 3.99× to 5.76× the performance
by batching in real-time applications, as batching requires that all the of a single core. Similarly, with eight cores it is possible to reach
inputs to the LSTM layers of the batch are ready. For instance, in [50] 5.55× the performance of a single-core implementation of an LSTM
increasing batching from 16 to 64 increases performance to 2.41× the RNN [28].
original value, whereas the time required to complete execution in
more than 99% of the cases increases from 7.2 to 21.3 ms (2.95×). A. Data-Parallel Processors
When compute-intensive applications were to be addressed in the
commercial market, computer architects resorted to packed single
IV. RISC-V Vector Processors
instruction multiple data (SIMD) ISAs with the Intel’s MMX exten-
State-of-the-art processors for space applications typically execute sions (1996) for integers [56] and the SSE extensions (1999) for floats
instructions on two scalar operands [51]. Considering a single core, [57]. The success of ARM in high-end embedded applications made
this type of platform has an MTPCC of 1 FLOP/CC. the SIMD NEON extension, first introduced in the ARMv7-A Cor-
The simplest way of increasing the MTPCC of future space pro- tex-A8 (2005) [58], very popular. Also PULP, one of the most
cessors is to introduce ISA extensions with instructions defining popular sets of RISC-V cores, employs the RI5CY packed SIMD
fused multiply-add (z←wx y) and fused multiply-accumulate extension (2016) defined outside of the RISC-V standard [59].
(z←xy z) operations,**** achieving an MTPCC of 2 FLOP/CC Packed SIMD extensions are typically chosen by hardware design-
(as shown in Fig. 5). This requires modifications to the floating point ers because they can be applied to scalar processors without extensive
unit (FPU) and arithmetic-logic unit (ALU). However, the cost of modifications to the microarchitecture [60]. However, the end of
these changes in the FPUs and ALUs is limited (as, for instance, the Moore’s law is leading computer architects to use more efficient
area of these units is dominated by the multiplier). The biggest cost is ISA extensions, and ARM recently (2017) released its ARMv8-A
instead on the complexity of the register file, which is required to Scalable Vector Extension (SVE) [61]. Although previous Fujitsu’s
provide more operators to the functional units [52]. supercomputers were based on SIMD extensions of SPARC, the
To increase the MTPCC even further, DLP is the most energy- Fujitsu A64FX is the first processor based on the ARMv8-A SVE,
efficient solution available [53]. As a matter of fact, large part of the targeting supercomputer applications. It achieves 2.7 DP-TFLOPS
(7 nm process), a DGEMM efficiency >90% [62] and it is composed
****Both will be indicated with FMA, unless a distinction is to be done. by 48 computing cores, each achieving around 57 DP-GFLOPS [62].
560 DI MASCIO ET AL.

Fig. 6 Block diagram of a decoupled vector pipeline.

Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

Vector extensions are already known to be more efficient than 2. Microarchitecture of Vector Processors
packed-SIMD, as they can be seen as more flexible versions of There are two main approaches to design a vector processor. Vector
packed-SIMD thanks to their time-multiplexed and vector length- processors for supercomputers, like the Fujitsu AF64X, typically
agnostic approach (the software is oblivious to the hardware vector have a joint scalar and vector pipeline with separated register files and
length of a specific implementation and the same code executes using execution units. The main disadvantage of this approach is that a
the largest parallelism available) [27,60,61]. In SIMD extensions vector load instruction stalls the pipeline also for scalar instructions,
instead, the data width of the operations is encoded in the instruction unless a superscalar pipeline with large ILP is employed (e.g., this is
opcodes. When the architects of such ISAs wish to increase perfor- done in the Fujitsu AF64X with up to four ways). When the ILP is not
mance by widening the vectors, they must add a new set of instruc- high enough, using a decoupled vector pipeline, where the scalar
tions to process these vectors. For instance, Intel’s newest AVX pipeline pushes vector instructions into an instruction queue inter-
instructions are as long as 11 bytes [60]. Furthermore, application facing the vector pipeline, can mitigate this issue. The scalar pipeline
code compiled for previous versions cannot automatically leverage can continue execution and the vector pipeline acknowledges com-
the widened vectors of new implementations. At the same time, code pletion of vector instructions and passes scalar results (when needed)
compiled for wider SIMD registers fails to execute on older machines to the scalar pipeline without passing through the bus. This approach
as the new instructions are not known to older implementations. is employed, for instance, for the Ara processor [27] and it is shown in
Furthermore, in SIMD extra code is needed to handle up to three
Fig. 6. Another advantage of this approach is that it provides a more
fringe elements of stripe mine loops [60].
modular solution and a vector version of a RISC-V processor can be
For these reasons, the proposal for packed-SIMD floating-point
achieved with minimal modifications to the scalar design (i.e., the
was dropped in favor of the Vextension for large floating-point vector
introduction of a front end).
operations [15]. However, there was interest in packed-SIMD fixed-
point operations for use in the integer registers of small RISC-V The critical elements of a vector processor are shown in Fig. 6. The
implementations. A task group is working to define the packed- following subsections will focus on the vector register file (Sec. IV.B),
SIMD P extension [15]. and on the issues limiting scalability of performance (Sec. IV.C).
Furthermore, Sec. IV.D provides insights on the soft error vulner-
1. RISC-V Vector Extension
ability of vector processors.
The RISC-V Vector Extension (RVVE) is similar to the ARMv8-A
B. Vector Register File
SVE and was heavily inspired by the Hwacha†††† development [63].
Both RVVE and ARMv8-A SVE define a configurable vector unit Vector register files (VRFs) are typically more complex than
with 32 vector registers (i.e., given a certain VRF size, the number of register files (RFs), as they have in general more contention given
elements and size of elements can be configured with instructions) FMA operations and masked execution‡‡‡‡ [27]. When considering
[15] and allow the same binary code to work efficiently across Ara, the worst case for contention for access to the VRF is the masked
a variety of hardware implementations, varying in physical vector FMA (multiply-add) instruction, which reads four operands from
storage capacity and data path parallelism. Additionally, ARMv8-A four vector registers (one mask, two factors, and one addend) [27],
SVE includes 16 scalable predicate registers (not defined in the executes the operation only if the mask has a certain value, and writes
baseline RVVE [64]) to optimize loops, using the predicate con- to a register the result of the operation. A straightforward solution to
trolled loops vectorization style [61]. avoid contention in the VRF is therefore to employ a multiported
Although the RVVE is still in the process of being standardized, it static random-access memory (SRAM) with as many ports as needed,
plays such a crucial role in state-of-the-art applications that already in this case four read ports and one write port (4R1W). However,
several developments implementing the RVVE are described in multiported register files come with a large area overhead. In [53], the
literature. The two most notable examples are the Xuantie-910, a area of the VRF for the T0 vector processor according to the different
12 nm RISC-V processor with 16 cores clocked up to 2.5 GHz number of ports employed is analyzed. As the T0 vector processor
with an out-of-order triple-issue 12-stage pipeline [65], and Ara, contains two arithmetic units and one multiplier per lane, to avoid
a RISC-V vector processor based on Ariane achieving up to 33 contention it requires one read port and one write port for the
GFLOP/s and 41 GFLOP/J on 22 nm fully-depleted silicon-on- multiplication, and two ports for read and one for write for each
insulator (FD-SOI) technology. Furthermore, work is being done to arithmetic unit (i.e., 5R3W). Different implementations in ASIC
support the RVVE in popular DNN frameworks like TensorFlow technology are proposed for the VRF, trading-off the number of
Lite [66]. banks and ports: one 5R3W bank of 256 elements (1×5R3W), two
††††
The main difference with RISC-V Vector extension is that Hwacha
‡‡‡‡
fetches its own instructions, as there are two threads: a control thread running RVVE provides for many instructions a field that specifies whether the
on the scalar core and a worker thread [60]. This can potentially lead to higher instruction is to be executed or not according to the value of a bit in a specific
performance, but also higher complexity. vector register [64].
DI MASCIO ET AL. 561

3R2W banks of 128 elements each (2×3R2W), and four 2R1W banks Table 2 Scalability of Ara in terms of number of lanes (peak values
of 64 elements each (4×2R1W). in bold) for 22FDX process (FD-SOI) (data derived from [27])
Data from [53] show that banking decreases the area occupied by Number of lanes
the VRF by 31.7% when going from 1×5R3W to 2×3R2W. How-
Performance metric 2 4 8 16
ever, the efficacy of this technique saturates quickly, as going from
2×3R2W to 4×2R1W decreases the area only by 2.1%. This is due to Max. frequency [normalized] 1.00 1.00 0.94 0.83
Max. FPU utilization [%] 98.20 98.00 97.22 97.36
the increase of overhead to handle the banks (storage cells compose
Area efficiency [DP-kFLOP/s/GE] 2.20 2.85 3.08 3.02
88.9% of the VRF for 1×5R3W, 83.1% for 2×3R2W, and only 41.7% Energy efficiency [DP-GFLOP/mJ] 35.58 37.84 39.91 40.81
for the 4×2R1W implementation).
Banking is also employed in Ara, where the VRF is composed of
eight single-ported read-or-write banks (1RW). To help avoid con-
tention, in Ara vectors are organized in SRAM banks with a shift of logic to handle the increased number of lanes. Therefore, area effi-
one element (“barber pole” shift) [27]. This is particular effective to ciency can be expected to be more critical than energy efficiency in
avoid conflicts when the functional units fetch the first elements of vector processors. Ariane and Ara occupy together between 2228 and
two vectors [27]. However, this organization leaves some residual
10,735 kGE. In particular, the area of Ariane and Ara with four lanes
contention, which is addressed with a round-robin with two priority
is 3434 kGE. i.e., 4.28 times a single-core Ariane comprising level 1
levels [27]. A way to completely solve bank contention is systolic
(L1) caches. Therefore a four-lane vector processor has similar
execution. For instance, Hwacha uses four 1R1W (4×1R1W) dual
requirements in terms of die area compared with state-of-the-art
port banks with stall-free systolic bank execution, capable to sustain
quad-core processor for space [51].
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

n operands per cycle to the shared functional units after an initial

n-cycle latency [63].
3. Small Matrices

C. Scalability
Along with the memory bound identified by the roofline model,
the authors of Ara [27] show that the limited issue rate of instructions
Although existing RISC-V vector processors have good scalability for a single-issue scalar pipeline limits the performance for matrices
in terms of peak performance and efficiency (as can be seen in of sizes smaller than 256 × 256. Therefore, they suggest that the use
Table 2), there are still criticalities to be addressed for small matrices of higher ILP and speculation in the scalar pipeline could improve
and very high requirements of peak performance. The remainder of performance for smaller matrices, where control operations
this subsection discusses how scalability influences frequency, effi- (e.g., configuration of the lanes) have a larger overhead. Similarly
ciency, the effects of the issue rate on the achieved performance and to [27] for an n × n matrix multiplication with SP parameters, an
the width of the interconnect. upper bound due to the issue rate #FLOP 16 OI∕ΔCCissue can be
found, and OI MTPCC ΔCCissue ∕16 due to the issue rate. This
1. Frequency
equation shows that doubling the issue rate (i.e., using a dual-issue
Most considerations in previous sections were based on the fre- microarchitecture) will halve the OI . For instance, as an FMA instruc-
quency-normalized value FLOP/CC, whereas a reduction of clock tion can be issued every five clock cycles (CCs) in Ara, the worst OI is
frequency decreases the peak performance in terms of FLOP/s 5 FLOP/B (8 lanes version with MTPCC 16 FLOP/CC), whereas a
(as #FLOP∕s fCPU #FLOP∕CC) and therefore can decrease dual-issue version lowers this value to 2.5 FLOP/B. As can be seen in
the efficiency of a platform with increased DLP. Sec. V, these values are comparable with upper bounds due to memory
In [27] Ara has been implemented in Global Foundries 22FDX bandwidth and therefore can have an impact on performance when they
process (FD-SOI). As can be seen in Table 2, the two-lane and four- produce an higher OI than memory bandwidth.
lane versions of Ara achieve the same maximum nominal frequency.
In both cases, the critical path is in the DP FMA FPU (1.2 GHz 4. Interconnect
nominal, 0.92 GHz worst case), about 40 gate delays long. Another
critical path (of the same length) is present in the combinational To increase the OI due to the memory bandwidth, Ara uses a single
handshake between the Vector Load and Store Unit (VLSU) and 32 N L -wide bus interface for all the lanes together,§§§§ reaching
operand queues in the lanes of the vector processor. When increasing 512 bits for 16 lanes. To keep the same value of 2 B/DP-FLOP, a
the number of lanes, the second path becomes longer, and therefore 32-lane implementation would need a 1024-bit-wide bus interface.
the frequency is reduced (down to 1.04 GHz for 16 lanes). This is However, this problem can be mitigated using an L1 cache for vector
because the VLSU handles data to and from all the lanes simulta- data (L1V), which allows large bandwidth for data residing in it without
neously. Therefore, a larger number of lanes imply longer combina- requiring a wide crossbar (Fig. 7). The design of an area efficient
tional paths. This shows that, in general, the scalability of the DLP in memory subsystem for RISC-V vector processors is described in Sec. V.
a vector processor is limited by the elements that act on all the
lanes [27]. D. Soft Error Vulnerability
It should be noted that the maximum frequency of the scalar Vector processors typically achieve high utilization of the FPU
processor on the same technology is 1.7 GHz [5]. Therefore, the (e.g., 97% in [27]), whereas scalar processors typically work in
two-lane version already comes with a penalty of at least 30% memory-bounded conditions and therefore achieve much lower
compared with the scalar processor. FPU utilization. This implies an increase of soft error vulnerability
of arithmetic units, as suggested by the models in [68] relating
2. Area and Energy Efficiency utilization and soft error vulnerability. Furthermore, the increase of
The increasing energy efficiency in Table 2 shows good scalability frequency compared with state-of-the-art processors for space
and suggests that the peak in energy efficiency may be obtained for an (e.g., from 250 MHz to 1 GHz) points to an increased percentage
even larger number of lanes. On 22 nm FD-SOI, Ariane and Ara of errors from combinational logic (as shown in [69]), which com-
(depending on the number of lanes) consume between 138 (2 lanes) pose the majority of the area in FPUs and ALUs. For instance, we
and 794 mW (16 lanes) at peak performance [27]. As energy effi- synthesized the BOOM processor¶¶¶¶ on a 65 nm ASIC technology
ciency depends on the ASIC technology employed, changing tech- and the area of the FPU and ALUs (comprising hardware multipli-
nology will provide different efficiency. Resorting to a 65 nm RHBD cation and division) results composed, respectively, for 79.52 and
technology would decrease energy efficiency because of larger 86.11% of combinational logic. Finally, scaling efficiently at least up
power consumption for a given clock frequency. to 16 lanes, a vector processor can achieve high performance when
Area efficiency reaches a maximum for 8 lanes, as for 16 lanes the
increase due to the decreased overhead of the scalar pipeline per §§§§
Hwacha, instead, uses an interface per lane [67].
¶¶¶¶
vector lane is more than compensated by the greater complexity of the https://github.com/riscv-boom/boom-template.git.
562 DI MASCIO ET AL.

Fig. 7 Possible memory hierarchy for a vector processor. Other cores and peripherals (not shown in figure) can be connected to the interconnect.
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

Fig. 8 Theoretical improvement for low OI workloads for matrices residing in L2 and L1V compared with SDR and DDR (single chip).

large ASIC implementations are possible. For this reason, small is added to increase performance especially for workloads with low
technology nodes should be preferred. However, in [70] it is reported OI. The figure also indicates the width W i of the interface between
that going below 28 nm increases the soft error rate (SER) in the levels, which determines the bandwidth Bi of the interface together
terrestrial environment. In FD-SOI technologies this is mainly due to with its clock frequency fclki , according to Bi fclki W i . For
an increase of SER due to protons, whereas the SER due to alpha instance, the Sandy Bridge in [33] has a 384-bit interface and a
particles is slightly decreasing. Given that in space there is a different maximum bandwidth of 384 b/CC. In the case of DRAMs, BD is
radiation environment, the technology node minimizing the SER given by RD CD fclkD W D, where RD is 1 for SDR and 2 for
may be different. DDR, CD is the number of channels for the DRAM, fclk the clock
The separation between scalar and vector pipeline in decoupled frequency, and W the word size. For the DRAM employed in the
vector processors allows for a selective hardening approach. Sandy Bridge in [33] CD 2, fclkD 0.8 GHz, and W D 64, and
Assuming that control operations are executed only in the scalar therefore BD is 25.6 GB/s.
pipeline and computations only in the vector pipeline, redundancy A cache-aware roofline model [33], shown in Fig. 8, highlights the
to avoid catastrophic failures is required only in the scalar pipeline. main benefits of adopting a memory hierarchy similar to Fig. 7. When
In Ara, the critical path limiting the maximum frequency for the data reside in main memory, OI is around 2.50–6.02 FLOP/B
four-lane version is in the vector pipeline and allows for a maximum (depending on the DRAM technology), whereas if data reside in
frequency of around 1 GHz, whereas the scalar pipeline has a an L2 (with W X 64 b) OI becomes 0.25 FLOP/B and a dedi-
critical path allowing up to 1.7 GHz [5]. Therefore, applying cated L1V with W C;V 356 b reduces OI to 0.04 FLOP/B.
state-of-the-art techniques to improve fault tolerance only to the
Furthermore, from Fig. 8 it can be deduced that keeping a processor
scalar pipeline, such as triple modular redundancy (TMR) at flip-
in a compute-bounded state for a given OI puts increasingly higher
flop level in the scalar pipeline and error detection and correction
requirements on the memory bandwidth when MTPCC (hence the
(EDAC) codes in the scalar register file, will not cause any penalties
computational capabilities) is increased (e.g., an implementation
in terms of maximum frequency and hence in terms of MTPCC .
with lower MTPCC has a lower OI ). As a result, extremely high-
As a matter of fact, TMR and EDAC are reported to cause only
performance processors for DNNs are actually memory-bounded
9% decrease in frequency in the LEON2 [71]. A similar decrease
except for very high OI [50].
would keep the maximum frequency of Ariane from 1.7 GHz [5] to
around 1.5 GHz, which is still above the maximum frequency
possible in the vector pipeline. A. Main Memory
The need for (at least) radiation-tolerant parts with solid flight
heritage limits the use of state-of-the-art memories. As a result, main
V. Memory Hierarchy memories for space in ESA missions lag behind commercial counter-
Figure 7 shows a possible memory hierarchy for a vector proces- parts in terms of performance. For instance, state-of-the-art OBCs
sor. As a typical memory hierarchy for scalar processors, it comprises typically employ single data rate (SDR) DRAM [72]. The SDR
an L1 cache for scalar data (L1D), a L1 instruction cache (L1I), a DRAM tested in [25] (ISSI IS42S86400B-7TL) has 16 bits for data
unified level 2 cache (L2),***** and a main memory. However, an L1V I/O and achieves up to 166 MHz. Therefore, its BD is 2.66 Gbps, i.e.,
*****This is typically the case of multicore processors (not shown in the two orders of magnitude less compared with the DDR3 DRAM
figure), where more cores with their own L1 caches are connected to the L2 via memories used in [33]. Faster DRAMs are also being considered,
an interconnect. as the DDR2 tested in [25] (IS43DR81280B-25DBLI), which has
DI MASCIO ET AL. 563

8 bits for I/O data and achieves up to 400 MHz. This means a BD of fCPU ∕fMC 2, we estimate 200 CCs of additional latency seen
6.4 Gbps, which is still more than one order of magnitude lower by the processor during reads due to the use of RS. This is a
compared with the DDR DRAM in [33]. significant increase (e.g., read latency of the DRAM chip around
20 ns [26], i.e., 15–20 CCs for fCPU 1 GHz), and therefore it may
1. EDAC Codes be required to lower the level of information redundancy or not
In the space environment, DRAMs suffer from single event upsets applying EDAC altogether on vector data to achieve the required
(SEUs) and multiple bit upsets (MBUs) as SRAMs [73]. However, in level of performance.
DRAMs most of the upsets happen in weakened cells [74]. Further-
more, compared with SRAMs, DRAMs are also more likely to suffer 2. Vulnerability of DNN Parameters
from stuck bits (cells stuck to a value, mostly related to variable bit To evaluate the effect of not applying EDAC on the DRAM when
retention [75]) and single event functional interrupts (SEFIs). The running a DNN, we estimate the effect of upsets on the parameters
effect of SEFIs in a DRAM ranges from some tens of bits to a full chip residing in the DRAM for CloudNet.
wrong per read cycle and can be recovered only with a chip reset or According to [74], a 512 Mb SDR DRAM memory
sometimes with a full power cycle [74]. To detect and correct these (MMSD08512408S-Y) experiences 2.75e–11 upset/bit/day in
errors, EDAC codes are employed in the DRAM. Including EDAC LEO. Therefore, 0.19 upsets/day are to be expected for coefficients
checkbits in DRAMs decreases the bandwidth, as also checkbits are and feature maps residing in the DRAM (using the peak memory
read and written, and increases latency, as the checkbits have to be reported in Sec. III). To assess the sensitivity to SEUs, we ran a fault
calculated before storing the data in memory and checked before injection campaign on the DNN coefficients expressed in SP floating
using the data read from the memory. For DRAMs in space embedded
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

point (expected to reside in the memory buffer) during the inference.

systems, typically Reed–Solomon (RS) codes are employed [51]. For each experiment a single error is injected and the accuracy of the
An RSn; k code takes a word of k symbols and generates a code classification over 9201 input patches is checked. The metric
word of n symbols, where n k 2t (with 2t being the number of employed to estimate the accuracy is the overall accuracy (OA)
check symbols). RS codes have a redundancy r 2t∕k, where r is defined in [34] as
typically 25 or 50%, meaning that they increase the number of bits
required to express the information by r. RS codes can correct errors TN TP
in up to t symbols [76]. Regarding the symbol size, conventional OA (9)
organization of DRAM-based memories uses several chips in parallel #Pixels
to constitute a rank of the desired data width (e.g., 64 bits) [77]. It is
therefore a straightforward choice to use as symbol length the IO where TN (true negatives) is the number of pixels correctly classified
width of a single chip. We will assume the chip to have an IO width of as without clouds, TP (true positives) is the number of pixels correctly
8 bits and therefore employ a byte-based RS, although different classified as covered by clouds, and #Pixels is the total number of
choices are possible. In this way, SEFIs can be masked and the failed pixels (therefore comprising also false negatives and false positives).
chip can be reset when it is less detrimental to functional availability. For a fault-free execution over the 9201 patches of the test set, the OA
For instance, in [51], 16 check bits or 32 check bits for 64 bits of data is 96.5%. In the majority of the cases, injecting upsets in the input
are employed, meaning, respectively, RS(10,8) and RS(12,8). images causes little or no damage to the accuracy of the DNN and the
RS comes with substantial penalties in terms of performance. OA usually does not go below 96.5%, except for when the bit flip
When adding RS with r 25% and r 50% to a memory module happens in the most significant bit (MSB) of the exponent. In this
with n chips, the average effective data bandwidth per chip becomes, case, a single bit flip can change a very small number in a very large
respectively, 75 and 50% of the original DRAM bandwidth. There- number and vice versa. For instance, 1.4293875e-05 (0x376FCFBA)
fore, the bandwidth of the SDR DRAM in [25] with RS reduces from can be turned into 4.8639537e+33 (0x776FCFBA). Therefore, even
2.66 Gbps to 2.00 Gbps (r 0.25) and 1.33 Gbps (r 0.5), whereas setting a very tight requirement on the OA, an SEU in a coefficient has
the bandwidth of the DDR2 reduces from 5.1 Gbps to 3.8 Gbps a 1 in 32 chance of causing the DNN to fail. Another large deviation
(r 0.25) and 2.6 Gbps (r 0.5). Furthermore, RS codes come also could take place when the bit flip happens in the sign bit and the data
with a substantial penalty in terms of latency. For instance, the have a large magnitude. This is not the case in CloudNet, as the
decoder proposed in [78] has a latency of L n 10t 20 CCs. maximum magnitude found for the parameters is around 0.59. This is
Typically, critical paths of memory controllers (MCs) are deeper than also to be expected in other DNNs, as typically regularization tech-
those of processors and run at lower frequency. For instance, the niques that keep the magnitude of parameters low are employed to
length of the critical path reported in [79] ranges between 547 and 48 avoid overfitting [43]. As an extreme case condition for the upset rate,
gates depending on the design complexity. Assuming 0.02 ns per gate we ran also experiments with 10 upsets simultaneously. Also in this
as for Ara [27], this limits the frequency in a range between 1.04 GHz case we note that large deviations are present only if one of the upsets
and 91 MHz. Therefore, we assume that the decoder runs at half the is in the MSB of the exponent (e.g., OA 61.3%). Assuming 0.19
frequency of Ara and we partially compensate this with a doubled upsets/day and that only upsets in the MSB of the exponent will cause
data width between the MC and L2 compared with the one between a failure due to insufficient quality of service (QoS), we can expect
L2 and L1V. Therefore, W L1;L2 W L2;MC ∕fCPU ∕fMC , where upsets to cause a failure due to SEU for insufficient QoS every
fCPU and fMC are, respectively, the frequency of the vector processor 165.4 days.
and the frequency of the MC. Therefore, the latency expressed in Other DNN architectures may be more vulnerable to SEUs. For
terms of CCs of Ara, keeping into account that n 1 rk and instance, in [80] it is shown that the FC layers in the last layers of
t r∕2k, is LCPU fCPU ∕fMC 1 r k 6r k 20. CNNs are more vulnerable compared with early convolutional layers.
Given that k W L2;MC ∕8 (as a symbol is composed of 8 bits) and However, the dependence of the vulnerability of a bit on its position is
W L1;L2 32 N L (following the rule of thumb reported in related to the format of the coefficients. For instance, in [81] the MSB
Sec. IV.C.4), the final expression is of the exponent is found to be the most critical bit of the SP model
2 coefficients also in CNNs and DNNs with LSTM layers. Further-
f f more, Ref. [80] shows that using half precision (HP) floating point
LCPU 4N L 1 6r CPU 20 CPU (8)
fMC fMC can increase robustness for some architectures compared with SP
floating point. Fixed point representation can mitigate the failure
It should be noted that the latency of this design has a quadratic mechanism described for floating point thanks to their limited range
dependence on the ratio of the frequencies and only a linear [80]. However, if the fixed point representation has a large integer part
dependence on the number of lanes N L . Therefore, having a low (e.g., 1 bit for sign, 21 for the integer part, and 10 for the decimal part)
fCPU ∕fMC ratio is very effective to help the scaling of performance the robustness of the DNN can be severely reduced compared with
with the number of lanes. Substituting N L 4, r 0.25, and floating point representations [80].
564 DI MASCIO ET AL.

3. Proposed Solutions for DRAMs workloads and finds that caches can significantly improve the perfor-
While the effect of SEUs on parameters can be tolerated by the mance of a vector processor. Furthermore, in [84] it is shown that
intrinsic robustness of DNNs, SEFIs produce an unpredictable number the use of caches helps masking memory latency, as increasing by
of errors per CC and therefore require mitigation. According to data 3.21× the latency of a memory access (from 14 CCs to 45 CCs)
from [74], a 512 Mb SDR DRAM memory (MMSD08512408S-Y) roughly triplicates the mean delay per memory reference for a proc-
experiences 1.33e−3 SEFI/device/day. To achieve the peak memory essor with uncached vector data and less than doubles the access time
required, 14 chips are required and therefore not including any EDAC for a processor with an L1 cache for vector data.
will produce a failure due to SEFIs every 53.7 days. This is unaccept- The following subsections will carry out a design exploration of
able, as every inference after the SEFI is likely to have insufficient QoS the L1V to assess which sizes, organizations, and write policies are
until the next reset of the failing chip. As a mitigation, DRAM chips more efficient for vector processors.
can be reset periodically. Assuming a reset every 2 h, the percentage of
failed inferences due to SEFIs WI SEFI in the worst case is 1. Size
From Table 1, it is clear that the large matrices originating from
FailuresSEFI ΔT rst unrolling of convolutional layers (ranging from 3 to 41 MiB) do not
WISEFI 0.16% (10)
Total inferences MTTFSEFI fit even in large L2 caches (e.g., 2 MiB [51]). This problem can be
addressed with tiling, as shown in Fig. 9. In this approach, two levels
The contribution to wrong inferences of SEUs can be estimated with of looping (shown in Fig. 9 with index i and j) select a subset of the
a similar equation, where the MTTFSEU in the denominator is divided matrix–matrix multiplication that produces one of the
by 0.03 to account for the discussion in Sec. V.A.2 on the vulnerable

Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

bits of floating point coefficients and T rst is replaced with the time UV N
required for a single inference T inf . The value found is negligible b b
(two orders of magnitude less than the contribution of SEFIs). How-
ever, the final value of average reliability Ravg 1 − WI SEU − WISEFI tiles of the result, each composed of b × b elements. By increasing
(99.84%) can be not deemed enough for critical applications. The the size of the cache, it is possible to work on larger matrix blocks
availability instead depends also on the maintenance time after a reset. residing in the L1V. The subset of operations obtained in Fig 9b can
If we assume a maintenance time of 30 s for each reset, we find that the be decomposed into
availability of the service is 99.58%, whereas a maintenance time of

300 s produces an availability of 95.83%. Both values are below typical CJK
requirements of dependable systems (e.g., [82]). b
A tradeoff between RS and no EDAC is represented by simpler
EDAC codes. EDAC codes with lower redundancy, although they segments, and the results of these segments can be accumulated to
cannot mask SEFIs, can still detect some of the wrong bits caused by generate the final result of the tile. The level (c) in Fig. 9 is where the
the SEFI. For instance, a parity bit per chip can detect an odd number mapping to SGEMM (described in Sec. III.A.1) can be applied.
of errors in a chip, and it is possible to keep track of them with a One of the possible implementations of SGEMM (Fig. 9d) is a loop
counter. When the number of errors from a chip exceeds a certain selecting the mth column of A0 and the mth row of A1 and generating
threshold in a certain time window, the DRAM chip is reset to recover a matrix where the pth column is the mth column of A0 multiplied by
from a probable SEFI. Assuming a threshold of three errors and an A1mp . Vectorization is applied with a maximum vector length of V L ,
equal probability that the SEFI will cause an even or odd number of with FMA (accumulate) operations between the vector A0m and a
errors, the percentage of wrong inferences due to SEFIs is scalar A1mp. A matrix representation of this implementation for a
2N thr 1 2 × 2 example is shown below.†††††
WISEFI 0.0009% (11) 0 1
MTTFSEFI ∕ΔT inf
A011 A111 A012 A121 A011 A112 A012 A121
Regarding SEUs, neglecting accumulation and MBUs, all the A2 @ A
upsets are detected. Therefore Rav 99.9991%, which is a substan- A021 A111 A022 A121 A021 A112 A012 A122
tial increment compared with employing no EDAC. There is a 0 1 0 1
substantial increment in availability too, with 99.9994 and j j j j
B C B C
99.994%, respectively, for 30 and 300 s of unavailability per reset. B C B C
B A01 A111 A01 A112 C B A02 A121 A02 A122 C
Table 3 summarizes the different EDAC and reset approaches @ A @ A
discussed to protect DRAMs for DNNs. j j j j

B. L1 Vector Cache As we are interested in investigating the speed increase due to

Many vector processors use L1 caching for instructions and the use of an L1V for small matrices, we will assume to be in
for scalar data, leaving vector data uncached (e.g., Ara [27]), as memory-bounded conditions (the computations happen in parallel
historically locality in vector workloads was assumed to be less with part of the loads and stores, although with a shorter duration).
pronounced compared with scalar workloads [83]. The work in [83] In these conditions, the execution time can be estimated as the time
characterizes temporal and spatial locality in compute-intensive vector required to read the matrices from main memory to the L1Vand the
time required to write to main memory the result a tile per time.
Loading a vector of length V L from main memory takes
Table 3 Approaches suggested for applications
with different criticality levels (reliability/availability) SE V L
and achievable performance T L;V T LM
BM
Approach
Characteristic No EDAC Parity RS where SE is the size of a single element of the vector, BM is the
Reset strategy Periodic Threshold Optimal bandwidth of the main memory, and the latency of the first element of
λQoS 0.03λSEU λSEFI λSEFI ≃0 the vector from main memory is T LM .‡‡‡‡‡ The time required to copy a
Ravg Low Medium High
†††††
Availability Low Medium High A similar implementation of SGEMM is described in [85].
‡‡‡‡‡
Performance High Medium Low Matrices are assumed to be stored in row-major order, as this is the
order employed in the C language.
DI MASCIO ET AL. 565

b)
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

d)
Fig. 9 Example of tiling of a matrix–matrix multiplication. “Acc.” stands for accumulation.

row vector of length b from main memory to L1V is UV CJK N
T L;CJK×UV T L;CJK×b T b×b 0
b b b
b SE b
T L;b T L;V
VL BM with b 0 UVmodb.
Similar equations can be derived for storing the result, substituting
and the time required to read an entire b × b tile is T L;b×b T L;b b. the subscript L with S. Only the final result for each tile is written to
The time required to read a fringe b × b 0 tile with b 0 < b is instead main memory; therefore the time to store all the results is
0
b SE b 0 UV N
T L;b×b 0 T L;V b T S;N×UV T S;N×b T L;b×b 0
VL BM b b

There are three possible implementations, depending on which tile with b 0 UVmodb.
(of the coefficient, input feature, and output feature matrix) is kept Considering the associate continuous functions (without modulo,
into the L1V during the innermost looping. Assuming that the output ceiling, and floor functions), it is possible to prove that the fastest
feature matrix is kept in L1V, the time required during the loop on implementation is the one keeping in L1V the tile of the output
CJK × b to load all the tiles in a CJK × b stripe of the CJK × UV feature matrix. This is because this implementation does not require
input feature matrix (as shown in Fig. 9b) is loading and storing of the temporary tile of the output matrix during
accumulation.
CJK To trade off the speed-up against the increase in size due to a larger
T L;CJK×b T L;b×b L1V, we consider the area efficiency in terms of FLOP/CC/GE for
b
matrix multiplications with matrices residing in L1V. To give a
whereas for a b × CJK stripe of the N × CJK matrix realistic estimation of the cache size that maximizes the area effi-
ciency, we consider what the effect of adding an L1V to Ara would
be in terms of area. The area of Ariane and Ara ranges from 2228
CJK
T L;b×CJK T L;b×b T b×b 0 for two lanes to 10,735 kGE for 16 lanes. As a worst case for
b memory-bounded conditions, we assume 16 lanes (V L 16), and
in this case the area without L1V is 10,735 GE. The area of the L1
where b 0 CJKmodb. As every column has to be multiplied for cache is estimated as AL1V;GE 6∕4N b , assuming 6T SRAM cells
every row, the total time spent reading the coefficient matrix is and a GE corresponding to four transistors.
We will consider four cases comprising all the combinations of
UV N memory with latency 50 CCs (representative of the latency without
T L;N×CJK T L;b×CJK
b b RS) and 300 CCs (representative of the latency with RS) and with
bandwidths of 4 and 40 b/CC (respectively, representative of a
where the ceiling is required because all the matrix of the coefficient memory module with 4 SDR chips and 4 DDR chips). Table 4 shows
is to be read again even if only one column of the input feature is left to the results of this model. The main observations are that the optimal
be loaded. Similarly, the total time spent reading the CJK × UV size of L1V is much larger (256 KiB-1 MiB) than a typical L1D
matrix is instead (e.g., 16 KiB [51]) and that the most impacting factor on the area
566 DI MASCIO ET AL.

Table 4 Estimates of area Atot [MGE] and area efficiency AE [FLOP/CC/MGE] for a 16-lane vector
processor with different sizes of L1V, main memory (latency and bandwidth), and maximum size
of the tile b × b when applying tiling to the layers of CloudNet
Characteristic 64 KiB 128 KiB 256 KiB 512 KiB 1 MiB 2 MiB
b 40 60 84 120 168 240
Atot 11.5 12.3 13.9 17.0 23.3 35.9
Layer 1: C 4; N 16; J K 3; U V 192
AE50;40 1.06E 0 1.44E 0 1.91E 0 2.35E 0 2:44E 0 2.34E 0
AE50;4 4.09E − 1 5.44E − 1 7.23E − 1 8.59E − 1 8:82E − 1 8.28E − 1
AE300;4 1.57E − 1 2.14E − 1 2.84E − 1 3.48E − 1 3:59E − 1 3.44E − 1
AE300;40 2.06E − 1 2.83E − 1 3.76E − 1 4.68E − 1 4:86E − 1 4.71E − 1
Layer 11: C 128, N 256, J K 3, U V 24
AE50;40 4.04E − 1 1.58E 0 2.03E 0 2:40E 0 2.33E 0 2.08E 0
AE50;4 2.62E − 1 5.97E − 1 7.65E − 1 8:72E − 1 8.43E − 1 7.33E − 1
AE300;4 6.48E − 2 2.35E − 1 3.01E − 1 3:54E − 1 3.43E − 1 3.05E − 1
AE300;40 7.09E − 2 3.11E − 1 3.99E − 1 4:78E − 1 4.63E − 1 4.17E − 1
Layer 19: C 512, N 1024, J K 3, U V 6
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

AE50;40 6.24E − 1 7.07E − 1 7:44E − 1 7.10E − 1 5.74E − 1 4.15E − 1

AE50;4 3.69E − 1 2.77E − 1 2:89E − 1 2.68E − 1 2.13E − 1 1.50E − 1
AE300;4 9.35E − 2 1.06E − 1 1:11E − 1 1.05E − 1 8.51E − 2 6.11E − 2
AE300;40 1.21E − 1 1.38E − 1 1:45E − 1 1.40E − 1 1.13E − 1 8.26E − 2

Peak values in bold.

efficiency is the dimensions of the convolution. For each layer, one assembly (which supports vector load and store strides of size
cache size maximizes the area efficiency independently of latency 1,2,3,4,8) using TVM.§§§§§ The fraction of vector accesses for stride
and bandwidth. This value decreases from 1 MiB to 256 KiB when 1, stride 2, and stride 4 are, respectively, 97.13, 1.62, and 1.25%. No
going from layers with large U V and small C and N to layers with accesses with stride 3 (supported in NEON) have been found. Trans-
small U V and large C and N. This means that processors intended lating other DNNs leads instead to only unit stride accesses. For
to run deeper CNNs can employ smaller caches with lower penalty. instance, translating the popular resnet18_v1 [41] model did not
However, the maximum area efficiency decreases going from layer 1 produce nonunit stride accesses.
to layer 11 to layer 19. These findings suggest that, although in a first phase this problem
could be mitigated relying on certain choices of DNN architectures
2. Organization and software implementation to reduce the fraction of nonunit vector
The model in the previous section assumes that it is possible to strides, in general different cache organizations are needed compared
keep the tiles in L1V, avoiding that loading a vector of one of the tiles with those typically employed for scalar processors.
causes the eviction of data belonging to one of the other tiles required.
Whether this happens or not depends on the cache organization and 3. Write Policy
an ineffective organization requires larger caches to allow the tiles to A microarchitecture with separated scalar and vector data caches
reside in the cache during computations. requires a solution to handle memory coherence issues when data in
Data-parallel ISA extensions (also the RVVE [64]) typically sup- one of the two is modified and an old value is read from the other. This
port vector load and store operations with nonunit stride V S ; i.e., two can be addressed with a write-through policy for L1V and L1D,
contiguous elements of the vector are placed in noncontiguous although this comes with substantial penalties especially in terms
location separated by V S − 1 elements. According to the model in of power [87], memory traffic [88], and performance [89].
[84], the fraction of nonunit strides in a workload determines whether
organizations similar to those of scalar processors are enough to
achieve acceptable performance or organizations specific for vector VI. Conclusions
processors are required. One example of the latter is prime-mapped The recent shift of focus of the space industry from large GEO to
caches [84], which have a conflict-free memory organization for small LEO satellites opens up new challenges. Limited downlink data
vectors with power-of-two strides. However, they have no advantage rates and short communication windows typically allow the trans-
against direct-mapped caches (the simplest cache organization for mission of just a fraction of the data generated by on-board sensors in
scalar processor) when all the strides are unitary. In [53] the break- small LEO satellites. The efficiency of the downlink can be increased
down of vector access in terms of vector memory accesses for 20 with data compression and with data removal (e.g., removing images
benchmarks running on tree different vector machines (Cray90, that have a certain percentage of pixels covered with clouds). This
Alliant FX/8, Convex C3) is reported. The respective percentages solution requires a dedicated processor that comes at relatively high
are 66.37% unit stride, 24.24% other strides, and 9.40% indexed (also cost in terms of power (around 5 W), which can be sustained only by
known as “scatter and gather” and also supported by the RVVE [64]). relatively large satellites. Furthermore, long periods without contact
The improvement with prime-mapped caches for a typical workload with the base station require an on-board virtual operator, monitoring
with unit stride of 70% is 2× over the cacheless version, whereas the the status of the satellite and making decisions when the communi-
improvement for direct mapped caches is below 1.5× [84]. cation with the ground station is not possible.
Typical applications that require nonunit strides are fast Fourier These challenges in terms of downlink efficiency and depend-
transform (FFT) and its inverse (IFT) [84]. FFT is employed in ability can be addressed with DNNs when it is possible to build
several compute-intensive workloads. For instance, in [86] it is relatively large datasets (e.g., thousands of images or months of
proposed to speed-up CNN execution, as convolutions can be sub- telemetry). Therefore, there is a need for large, public, and standard-
stituted by a sequence of FFT, elementwise multiplication, and IFT. ized datasets to be used as challenges for DNN architectures to
To investigate whether vector loads and stores with nonunit strides
§§§§§
are present in DNNs, we translated CloudNet into ARM NEON https://github.com/apache/incubator-tvm.
DI MASCIO ET AL. 567

be deployed in space applications. However, part of future LEO Journal of Aerospace Information Systems, Vol. 16, No. 11, 2019,
satellites are planned to be launched in large constellations, making pp. 454–472.
large datasets more easily available in the future. https://doi.org/10.2514/1.I010735
The analysis of the workloads associated with DNNs shows [5] Zaruba, F., and Benini, L., “The Cost of Application-Class Processing:
Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit
that most parts are very compute-intensive and can be mapped to RISC-V Core in 22-nm FDSOI Technology,” IEEE Transactions on
matrix–matrix multiplications, for which DLP is the most efficient Very Large Scale Integration (VLSI) Systems, Vol. 27, No. 11, 2019,
microarchitectural solution to increase execution speed. Among the pp. 2629–2640.
data-parallel ISA extensions available, the RVVE is gaining momen- https://doi.org/10.1109/TVLSI.2019.2926114
tum because of its openness and efficiency. Although there are [6] Li, X., Adve, S. V., Bose, P., and Rivers, J. A., “Architecture-Level Soft
already processors based on the RVVE, the software ecosystem of Error Analysis: Examining the Limits of Common Assumptions,”
the RVVE is in an early stage, as the ISA specifications are not frozen 37th Annual IEEE/IFIP International Conference on Dependable Sys-
yet. Therefore, during the early development of a RISC-V vector tems and Networks (DSN’07), IEEE Publ., Piscataway, NJ, 2007,
processor, some adjustments may be required. This is a risk that can pp. 266–275.
https://doi.org/10.1109/DSN.2007.15
be accepted given the long development times of space processors.
[7] Blacker, P., Bridges, C. P., and Hadfield, S., “Rapid Prototyping of
The analysis of the microarchitecture of a vector processor shows Deep Learning Models on Radiation Hardened CPUs,” 2019 NASA/
possible criticalities both for computational capabilities and for the ESA Conference on Adaptive Hardware and Systems (AHS), IEEE
memory hierarchy. For instance, the scalability with the number of Publ., Piscataway, NJ, 2019, pp. 25–32.
lanes can be an issue, especially for operations involving all of them. https://doi.org/10.1109/AHS.2019.000-4
The width of the bus interface has also been found to be a possible [8] Lai, L., and Suda, N., “Enabling Deep Learning at the IoT Edge,” 2018
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

bottleneck, and the use of an L1V has been suggested as a possible IEEE/ACM International Conference on Computer-Aided Design
mitigation approach. L1 caches for vector data maximize the area (ICCAD), IEEE Publ., Piscataway, NJ, 2018, pp. 1–6.
efficiency when executing convolutional layers when their size is https://doi.org/10.1145/3240765.3243473
[9] Furano, G., Meoni, G., Dunne, A., Moloney, D., Ferlet-Cavrois, V.,
around 256 KiB–1 MiB. Furthermore, the microarchitecture of the
Tavoularis, A., Byrne, J., Buckley, L., Psarakis, M., Voss, K.-O., and
scalar pipeline affects the performance for small OI, given the limited Fanucci, L., “Towards the Use of Artificial Intelligence on the Edge in
issue rate of microarchitectures with low ILP. Furthermore, it is Space Systems: Challenges and Opportunities,” IEEE Aerospace and
possible to apply to decoupled vector and scalar pipelines different Electronic Systems Magazine, Vol. 35, No. 12, 2020, pp. 44–56.
approaches in terms of redundancy to reduce penalties in terms of https://doi.org/10.1109/MAES.2020.3008468
performance. [10] Lentaris, G., Maragos, K., Stratakos, I., Papadopoulos, L., Papaniko-
The relatively large size and the focus on high performance of laou, O., Soudris, D., Lourakis, M., Zabulis, X., Gonzalez-Arjona, D.,
vector processors requires the identification of a radiation-tolerant and Furano, G., “High-Performance Embedded Computing in Space:
ASIC technology with a technology node around 28 nm (considering Evaluation of Platforms for Vision-Based Navigation,” Journal of Aero-
also the SER), whereas state-of-the-art processors in space systems space Information Systems, Vol. 15, No. 4, 2018, pp. 178–192.
https://doi.org/10.2514/1.I010555
are typically still based on RHBD 65 nm technologies. Furthermore, [11] Pignol, M., “COTS-Based Applications in Space Avionics,” 2010
an ASIC technology with multiported SRAMs is required for an area- Design, Automation Test in Europe Conference Exhibition (DATE
efficient implementation of the VRF. 2010), IEEE Publ., Piscataway, NJ, 2010, pp. 1213–1219.
Finally, this work investigated the performance and dependability https://doi.org/10.1109/DATE.2010.5456992
characteristics of the main memory, one of the most important [12] Del Sozzo, E., Solazzo, A., Miele, A., and Santambrogio, M. D., “On the
tradeoffs in space embedded systems. Demanding applications Automation of High Level Synthesis of Convolutional Neural Net-
(e.g., image classification) require a main memory with around 1 works,” 2016 IEEE International Parallel and Distributed Processing
GiB capacity, which is more than the typical DRAM capacity Symposium Workshops (IPDPSW), IEEE Publ., Piscataway, NJ, 2016,
required in many space mission. When availability is not a primary pp. 217–224.
https://doi.org/10.1109/IPDPSW.2016.153
concern, EDAC codes for DRAMs with low redundancy and latency [13] Xi, S. L., Yao, Y., Bhardwaj, K., Whatmough, P., Wei, G.-Y., and
can be employed to detect SEFIs and restart DRAM chips in non- Brooks, D., “SMAUG: End-to-End Full-Stack Simulation Infrastructure
critical applications. In even less critical applications, periodic resets for Deep Learning Workloads,” ACM Transactions on Architecture and
of DRAM chips can be deemed sufficient. For critical applications Code Optimization, Vol. 17, No. 4, 2020, pp. 1–26.
RS is still required. Therefore, some performance-demanding appli- https://doi.org/10.1145/3424669
cations requiring high availability (e.g., online processing) may be [14] Andersson, J., “Development of a NOEL-V RISC-V SoC Targeting
unfeasible. Space Applications,” 2020 50th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks Workshops (DSN-
W), IEEE Computer Soc., Los Alamitos, CA, 2020, pp. 66–67.
Acknowledgments https://doi.org/10.1109/DSN-W50199.2020.00020
[15] “The RISC-V Instruction Set Manual Volume I: Unprivileged ISA,
This work was supported by the European Space Agency under Document Version 20190608-Base-Ratified,” RISC-V Foundation,
the NPI Program, Cobham Gaisler AB, and Delft University of 2019, https://content.riscv.org/wp-content/uploads/2019/06/riscv-spec.
Technology. pdf.
[16] Henry, C., “Geostationary Satellite Orders Bouncing Back,” 2020,
https://spacenews.com/geostationary-satellite-orders-bouncing-back/.
References [17] Lal, B., Sylak-Glassman, E., Mineiro, M., Gupta, N., Pratt, L., and
[1] Lemley, J., Bazrafkan, S., and Corcoran, P., “Deep Learning for Con- Azari, A., “Global Trends in Space Volume 2: Trends by Subsector and
sumer Devices and Services: Pushing the Limits for Machine Learning, Factors that Could Disrupt Them,” Vol. 2, Inst. for Defense Analyses,
Artificial Intelligence, and Computer Vision,” IEEE Consumer Elec- Science & Technology Policy Inst., IDA Paper P-5242, 2015, https://
tronics Magazine, Vol. 6, No. 2, 2017, pp. 48–56. www.ida.org/-/media/feature/publications/g/gl/global-trends-in-space-
https://doi.org/10.1109/MCE.2016.2640698 volume-2-trends-by-subsector-and-factors-that-could-disrupt-them/
[2] Schwank, J. R., Shaneyfelt, M. R., and Dodd, P. E., “Radiation Hardness p5242v2.ashx.
Assurance Testing of Microelectronic Devices and Integrated Circuits: [18] Maral, G., Bousquet, M., and Sun, Z., Satellite Communications Sys-
Radiation Environments, Physical Mechanisms, and Foundations for tems: Systems, Techniques and Technology, Wiley, Hoboken, NJ, 2020,
Hardness Assurance,” IEEE Transactions on Nuclear Science, Vol. 60, Chap. 1.
No. 3, 2013, pp. 2074–2100. [19] Radtke, J., Kebschull, C., and Stoll, E., “Interactions of the Space Debris
https://doi.org/10.1109/TNS.2013.2254722 Environment with Mega Constellations—Using the Example of the
[3] Wyrwas, E., “Proton Testing of AMD e9173 GPU,” 2019, https://nepp OneWeb Constellation,” Acta Astronautica, Vol. 131, Feb. 2017,
.nasa.gov/files/30362/NEPP-TR-2019-Wyrwas-TR-19-022_AMD- pp. 55–68.
e9173-GPU-2019 June02-TN72682.pdf. https://doi.org/10.1016/j.actaastro.2016.11.021
[4] Di Mascio, S., Menicucci, A., Gill, E., Furano, G., and Monteleone, C., [20] Selva, D., and Krejci, D., “A Survey and Assessment of the Capabilities
“Leveraging the Openness and Modularity of RISC-V in Space,” of Cubesats for Earth Observation,” Acta Astronautica, Vol. 74, May
568 DI MASCIO ET AL.

2012, pp. 50–68. Workshop on Frontiers in Handwriting Recognition, edited by G.

https://doi.org/10.1016/j.actaastro.2011.12.014 Lorette, Univ. de Rennes 1, Suvisoft, La Baule (France), 2006, https://
[21] OMeara, C., Schlag, L., and Wickler, M., “Applications of Deep hal.inria.fr/inria-00112631.
Learning Neural Networks to Satellite Telemetry Monitoring,” 2018 [40] Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubt-
SpaceOps Conference, AIAA Paper 2018-2558, 2018. sov, R., Henry, G., Shet, A. G., Chrysos, G., and Dubey, P., “Design and
https://doi.org/10.2514/6.2018-2558 Implementation of the Linpack Benchmark for Single and Multi-node
[22] Tsitas, S., and Kingston, J., “6U CubeSat Design for Earth Observation Systems Based on Intel® Xeon Phi Coprocessor,” 2013 IEEE 27th
with 6.5m GSD, Five Spectral Bands and 14 Mbps Downlink,” Aero- International Symposium on Parallel and Distributed Processing,
nautical Journal, Vol. 114, No. 1161, 2010, pp. 689–697. 2013, pp. 126–137.
https://doi.org/10.1017/S0001924000004176 https://doi.org/10.1109/IPDPS.2013.113
[23] Gillette, A., Wilson, C., and George, A. D., “Efficient and Autonomous [41] He, K., Zhang, X., Ren, S., and Sun, J., “Deep Residual Learning for
Processing and Classification of Images on Small Spacecraft,” 2017 Image Recognition,” 2016 IEEE Conference on Computer Vision and
IEEE National Aerospace and Electronics Conference (NAECON), Pattern Recognition (CVPR), IEEE Publ., Piscataway, NJ, 2016,
IEEE Publ., Piscataway, NJ, 2017, pp. 135–141. pp. 770–778.
https://doi.org/10.1109/NAECON.2017.8268758 https://doi.org/10.1109/CVPR.2016.90
[24] Goward, S. N., Masek, J. G., Williams, D. L., Irons, J. R., and Thomp- [42] Ioffe, S., and Szegedy, C., “Batch Normalization: Accelerating Deep
son, R. J., “The Landsat 7 Mission: Terrestrial Research and Applica- Network Training by Reducing Internal Covariate Shift,” arXiv preprint
tions for the 21st Century,” Remote Sensing of Environment, Vol. 78, arXiv:1502.03167, 2015.
No. 1, 2001, pp. 3–12. [43] Phaisangittisagul, E., “An Analysis of the Regularization Between L2
https://doi.org/10.1016/S0034-4257(01)00262-0 and Dropout in Single Hidden Layer Neural Network,” 2016 7th
[25] Guertin, S. M., and Amrbar, M., “Single Event Testing of SDRAM, International Conference on Intelligent Systems, Modelling and Simu-
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

DDR2 and DDR3 Memories,” 2016 IEEE Radiation Effects Data Work- lation (ISMS), IEEE Publ., Piscataway, NJ, 2016, pp. 174–179.
shop (REDW), IEEE Publ., Piscataway, NJ, 2016, pp. 1–7. https://doi.org/10.1109/ISMS.2016.14
https://doi.org/10.1109/NSREC.2016.7891742 [44] Lai, L., Suda, N., and Chandra, V., “Cmsis-nn: Efficient Neural Network
[26] “IS43/46DR81280B(L), IS43/46DR16640B(L) Datasheet,” Integrated Kernels for Arm Cortex-m cpus,” arXiv preprint arXiv:1801.06601, 2018.
Silicon Solution, Inc. (ISSI), 2015, http://www.issi.com/WW/pdf/43- [45] Lee, C.-Y., Gallagher, P., and Tu, Z., “Generalizing Pooling Functions in
46DR81280B-16640B.pdf. CNNs: Mixed, Gated, and Tree,” IEEE Transactions on Pattern Analy-
[27] Cavalcante, M., Schuiki, F., Zaruba, F., Schaffner, M., and Benini, L., sis and Machine Intelligence, Vol. 40, No. 4, 2018, pp. 863–875.
“Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Proces- https://doi.org/10.1109/TPAMI.2017.2703082
sor with Multiprecision Floating-Point Support in 22-nm FD-SOI,” [46] Cong, J., and Xiao, B., “Minimizing Computation in Convolutional
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Neural Networks,” Artificial Neural Networks and Machine Learning—
Vol. 28, No. 2, 2020, pp. 530–543. ICANN 2014, edited byS. Wermter, C. Weber, W. Duch, T. Honkela,
https://doi.org/10.1109/TVLSI.2019.2950087 P. Koprinkova-Hristova, S. Magg, G. Palm, and A. E. P. Villa, Springer
[28] Cappellone, D., Di Mascio, S., Furano, G., and Ottavi, A. M. M., “On International Publishing, Cham, Switzerland, 2014, pp. 281–290.
Board Satellite Telemetry Forecasting with RNN on RISC-V Based https://doi.org/10.1007/978-3-319-11179-7_36.
Multicore Processor,” 2020 IEEE International Symposium on Defect [47] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares,
and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), IEEE F., Schwenk, H., and Bengio, Y., “Learning Phrase Representations
Publ., Piscataway, NJ, 2020, pp. 1–6. Using RNN Encoder-Decoder for Statistical Machine Translation,”
https://doi.org/10.1109/DFT50435.2020.9250796 Proceedings of the 2014 Conference on Empirical Methods in Natural
[29] Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S., “Efficient Processing Language Processing (EMNLP), Assoc. for Computational Linguistics,
of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the Stroudsburg, PA, 2014, pp. 1724–1734.
IEEE, Vol. 105, No. 12, 2017, pp. 2295–2329. [48] Graves, A., “Supervised Sequence Labelling with Recurrent Neural
https://doi.org/10.1109/JPROC.2017.2761740 Networks,” Ph.D. Dissertation, Technical Univ. of Munich, Munich,
[30] Luo, C., Li, X., Wang, L., He, J., Li, D., and Zhou, J., “How Does the 2008.
Data Set Affect CNN-based Image Classification Performance?” 2018 [49] Graves, A., Mohamed, A.-R., and Hinton, G., “Speech Recognition with
5th International Conference on Systems and Informatics (ICSAI), IEEE Deep Recurrent Neural Networks,” 2013 IEEE International
Publ., Piscataway, NJ, 2018, pp. 361–366. Conference on Acoustics, Speech and Signal Processing, Inst. of
https://doi.org/10.1109/ICSAI.2018.8599448 Electrical and Electronics Engineers, New York, 2013, pp. 6645–
[31] Phiri, D., and Morgenroth, J., “Developments in Landsat Land Cover 6649.
Classification Methods: A Review,” Remote Sensing, Vol. 9, No. 9, [50] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R.,
2017, pp. 967. Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-L.,
https://doi.org/10.3390/rs9090967 Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B.,
[32] Williams, S., Waterman, A., and Patterson, D., “Roofline: An Insightful Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.
Visual Performance Model for Multicore Architectures,” Communica- R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A.,
tions of the ACM, Vol. 52, No. 4, 2009, p. 65–76. Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar,
https://doi.org/10.1145/1498765.1498785 N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K.,
[33] Ilic, A., Pratas, F., and Sousa, L., “Cache-Aware Roofline Model: Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K.,
Upgrading the Loft,” IEEE Computer Architecture Letters, Vol. 13, Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omer-
No. 1, 2014, pp. 21–24. nick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A.,
https://doi.org/10.1109/L-CA.2013.6 Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Stein-
[34] Mohajerani, S., and Saeedi, P., “Cloud-Net: An End-to-End Cloud berg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E.,
Detection Algorithm for Landsat 8 Imagery,” IGARSS 2019—2019 Vasudevan, V., Walter, R., Wang, W., Wilcox, E., and Yoon, D. H., “In-
IEEE International Geoscience and Remote Sensing Symposium, IEEE Datacenter Performance Analysis of a Tensor Processing Unit,”
Publ., Piscataway, NJ, 2019, pp. 1029–1032. SIGARCH Computer Architecture News, Vol. 45, No. 2, 2017, p. 1–12.
https://doi.org/10.1109/IGARSS.2019.8898776 https://doi.org/10.1145/3140659.3080246
[35] Shelhamer, E., Long, J., and Darrell, T., “Fully Convolutional Networks [51] Andersson, J., Hjorth, M., Johansson, F., and Habinc, S., “LEON
for Semantic Segmentation,” IEEE Transactions on Pattern Analysis Processor Devices for Space Missions: First 20 Years of LEON in
and Machine Intelligence, Vol. 39, No. 4, 2017, pp. 640–651. Space,” 2017 6th International Conference on Space Mission Chal-
https://doi.org/10.1109/TPAMI.2016.2572683 lenges for Information Technology (SMC-IT), IEEE Publ., Piscataway,
[36] Bianco, S., Cadene, R., Celona, L., and Napoletano, P., “Benchmark NJ, 2017, pp. 136–141.
Analysis of Representative Deep Neural Network Architectures,” IEEE https://doi.org/10.1109/SMC-IT.2017.31
Access, Vol. 6, Oct. 2018, pp. 64,270–64,277. [52] Lopez, D., Llosa, J., Ayguade, E., and Valero, M., “Impact on Perfor-
https://doi.org/10.1109/ACCESS.2018.2877890 mance of Fused Multiply-Add Units in Aggressive VLIW Architec-
[37] Abdelouahab, K., Pelcat, M., Sérot, J., and Berry, F., “Accelerating CNN tures,” Proceedings of the 1999 International Conference on Parallel
inference on FPGAs: A Survey,” 2018, http://arxiv.org/abs/1806.01683. Processing, IEEE Publ., Piscataway, NJ, 1999, pp. 22–29.
[38] Dumoulin, V., and Visin, F., “A Guide to Convolution Arithmetic for [53] Asanovic, K., and Wawrzynek, J., Vector Microprocessors, Univ. of
Deep Learning,” arXiv preprint arXiv:1603.07285, 2016. California, Berkeley, CA, 1998.
[39] Chellapilla, K., Puri, S., and Simard, P., “High Performance Convolu- [54] Lee, S.-J., Park, S.-S., and Chung, K.-S., “Efficient SIMD Implementation
tional Neural Networks for Document Processing,” Tenth International for Accelerating Convolutional Neural Network,” Proceedings of the 4th
DI MASCIO ET AL. 569

International Conference on Communication and Information Processing, [70] Hubert, G., Artola, L., and Regis, D., “Impact of Scaling on the
Assoc. for Computing Machinery, New York, 2018, pp. 174–179. Soft Error Sensitivity of Bulk, FDSOI and FinFET Technologies
https://doi.org/10.1145/3290420.3290444 due to Atmospheric Radiation,” Integration, Vol. 50, June 2015,
[55] Flamand, E., Rossi, D., Conti, F., Loi, I., Pullini, A., Rotenberg, F., and pp. 39–47.
Benini, L., “GAP-8: A RISC-V SoC for AI at the Edge of the IoT,” 2018 https://doi.org/10.1016/j.vlsi.2015.01.003
IEEE 29th International Conference on Application-specific Systems, [71] Gaisler, J., “A Portable and Fault-Tolerant Microprocessor Based on the
Architectures and Processors (ASAP), IEEE Publ., Piscataway, NJ, SPARC v8 Architecture,” Proceedings International Conference on
2018, pp. 1–4. Dependable Systems and Networks, IEEE Publ., Piscataway, NJ,
https://doi.org/10.1109/ASAP.2018.8445101. 2002, pp. 409–415.
[56] Peleg, A., and Weiser, U., “MMX Technology Extension to the Intel https://doi.org/10.1109/DSN.2002.1028926
Architecture,” IEEE Micro, Vol. 16, No. 4, 1996, pp. 42–50. [72] “OSCAR OBC,” Airbus, 2018, https://www.airbus.com/content/dam/
https://doi.org/10.1109/40.526924 products-and-solutions/space/spacecraft-equipment/sce-datasheets/
[57] Thakkur, S., and Huff, T., “Internet Streaming SIMD Extensions,” Publication-sce-oscar.pdf.
Computer, Vol. 32, No. 12, 1999, pp. 26–34. [73] Petit, S., David, J. P., Falguere, D., Duzellier, S., Inguimbert, C., Nuns,
https://doi.org/10.1109/2.809248 T., and Ecoffet, R., “Memories Response to MBU and Semi-Empirical
[58] Doolan, D. C., Tabirca, S., and Yang, L. T., “Mobile Parallel Comput- Approach for SEE Rate Calculation,” IEEE Transactions on Nuclear
ing,” 2006 Fifth International Symposium on Parallel and Distributed Science, Vol. 53, No. 4, 2006, pp. 1787–1793.
Computing, IEEE Publ., Piscataway, NJ, 2006, pp. 161–167. https://doi.org/10.1109/TNS.2006.872153
https://doi.org/10.1109/ISPDC.2006.33 [74] Samaras, A., Bezerra, F., Lorfevre, E., and Ecoffet, R., “CARMEN-2: In
[59] Gautschi, M., Schiavone, P. D., Traber, A., Loi, I., Pullini, A., Rossi, D., Flight Observation of Nondestructive Single Event Phenomena on
Flamand, E., Gürkaynak, F. K., and Benini, L., “Near-Threshold RISC- Memories,” 2011 12th European Conference on Radiation and Its
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

V Core With DSP Extensions for Scalable IoT Endpoint Devices,” IEEE Effects on Components and Systems, IEEE Publ., Piscataway, NJ,
Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, 2011, pp. 839–848.
No. 10, 2017, pp. 2700–2713. https://doi.org/10.1109/RADECS.2011.6131314
https://doi.org/10.1109/TVLSI.2017.2654506 [75] Bacchini, A., Furano, G., Rovatti, M., and Ottavi, M., “Total Ionizing
[60] Dabbelt, D., Schmidt, C., Love, E., Mao, H., Karandikar, S., and Dose Effects on DRAM Data Retention Time,” IEEE Transactions on
Asanovic, K., “Vector Processors for Energy-Efficient Embedded Nuclear Science, Vol. 61, No. 6, 2014, pp. 3690–3693.
Systems,” Proceedings of the Third ACM International Workshop on https://doi.org/10.1109/TNS.2014.2365532
Many-Core Embedded Systems, Assoc. for Computing Machinery, [76] Kumar, A., and Sawitzki, S., “High-Throughput and Low-Power Archi-
New York, 2016, pp. 10–16. tectures for Reed Solomon Decoder,” Conference Record of the Thirty-
https://doi.org/10.1145/2934495.2934497 Ninth Asilomar Conference on Signals, Systems and Computers, IEEE
[61] Stephens, N., Biles, S., Boettcher, M., Eapen, J., Eyole, M., Gabrielli, Publ., Piscataway, NJ, 2005, pp. 990–994.
G., Horsnell, M., Magklis, G., Martinez, A., Premillieu, N., Reid, A., https://doi.org/10.1109/ACSSC.2005.1599906
Rico, A., and Walker, P., “The ARM Scalable Vector Extension,” IEEE [77] Udipi, A. N., Muralimanohar, N., Chatterjee, N., Balasubramonian, R.,
Micro, Vol. 37, No. 2, 2017, pp. 26–39. Davis, A., and Jouppi, N. P., “Rethinking DRAM Design and Organi-
https://doi.org/10.1109/MM.2017.35 zation for Energy-Constrained Multi-Cores,” SIGARCH Computer
[62] Shimizu, T., “Post-K Supercomputer with Fujitsu’s Original CPU, Architecture News, Vol. 38, No. 3, 2010, p. 175–186.
A64FX Powered by Arm ISA,” 2018, https://www.fujitsu.com/global/ https://doi.org/10.1145/1816038.1815983
Images/post-k_supercomputer_with_fujitsu% [78] Hanho, L., “High-Speed VLSI Architecture for Parallel Reed-Solomon
27s_original_cpu_a64fx_powered_by_arm_isa.pdf. Decoder,” IEEE Transactions on Very Large Scale Integration (VLSI)
[63] Lee, Y., Ou, A., Schmidt, C., Karandikar, S., Mao, H., and Asanovic, K., Systems, Vol. 11, No. 2, 2003, pp. 288–294.
“The Hwacha Microarchitecture Manual, Version 3.8.1,” Electrical https://doi.org/10.1109/TVLSI.2003.810782
Engineering and Computer Sciences Dept., Univ. of California TR [79] Shayan, Y. R., and Le-Ngoc, T., “A Cellular Structure for a Versatile
UCB/EECS-2015-263, Berkeley, CA, 2015, https://www2.eecs Reed-Solomon Decoder,” IEEE Transactions on Computers, Vol. 46,
.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-263.html. No. 1, 1997, pp. 80–85.
[64] “RISC-V ‘V’ Vector Extension, Version 0.9,” 2020, https://github. https://doi.org/10.1109/12.559805
com/riscv/riscv-v-spec/releases/download/0.9/riscv-v-spec-0.9.pdf [80] Li, G., Hari, S. K. S., Sullivan, M., Tsai, T., Pattabiraman, K., Emer, J.,
[retrieved 2 July 2020]. and Keckler, S. W., “Understanding Error Propagation in Deep Learning
[65] Chen, C., Xiang, X., Liu, C., Shang, Y., Guo, R., Liu, D., Lu, Y., Hao, Z., Neural Network (DNN) Accelerators and Applications,” Proceedings
Luo, J., Chen, Z., Li, C., Pu, Y., Meng, J., Yan, X., Xie, Y., and Qi, X., of the International Conference for High Performance Computing,
“Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of- Networking, Storage and Analysis, Assoc. for Computing Machinery,
Order 64-Bit High Performance RISC-V Processor with Vector Exten- New York, 2017.
sion : Industrial Product,” 2020 ACM/IEEE 47th Annual International https://doi.org/10.1145/3126908.3126964
Symposium on Computer Architecture (ISCA), IEEE Publ., Piscataway, [81] Zhang, Z., Huang, L., Huang, R., Xu, W., and Katz, D. S., “Quantifying
NJ, 2020, pp. 52–64. the Impact of Memory Errors in Deep Learning,” 2019 IEEE
https://doi.org/10.1109/ISCA45697.2020.00016 International Conference on Cluster Computing (CLUSTER), IEEE
[66] Louis, M. S., Azad, Z., Delshadtehrani, L., Gupta, S., Warden, P., Reddi, Publ., Piscataway, NJ, 2019, pp. 1–12.
V. J., and Joshi, A., “Towards Deep learning Using TensorFlow Lite on https://doi.org/10.1109/CLUSTER.2019.8890989
RISC-V,” Third Workshop on Computer Architecture Research with [82] Kosinski, B., and Dodson, K., “Key Attributes to Achieving >99.99
RISC-V (CARRV), 2019, Paper 7, https://carrv.github.io/2019/papers/ Satellite Availability,” 2018 IEEE International Reliability Physics
carrv2019_paper_7.pdf. Symposium (IRPS), IEEE Publ., Piscataway, NJ, 2018, pp. 6A.3-1–
[67] Lee, Y., Waterman, A., Avizienis, R., Cook, H., Sun, C., Stojanović, V., 6A.3-10.
and Asanović, K., “A 45 nm 1.3 GHz 16.7 Double-Precision GFLOPS/ https://doi.org/10.1109/IRPS.2018.8353620
W RISC-V Processor with Vector Accelerators,” ESSCIRC 2014—40th [83] Gee, J. D., and Smith, A. J., “Vector Processor Caches,” Electrical
European Solid State Circuits Conference (ESSCIRC), IEEE Publ., Engineering and Computer Sciences Dept., Univ. of California, TR
Piscataway, NJ, 2014, pp. 199–202. UCB/CSD-92-707, Berkeley, CA, Oct. 1992, http://www2.eecs
https://doi.org/10.1109/ESSCIRC.2014.6942056 .berkeley.edu/Pubs/TechRpts/1992/6251.html.
[68] Mukherjee, S. S., Weaver, C., Emer, J., Reinhardt, S. K., and Austin, T., [84] Yang, Q., “Introducing a New Cache Design into Vector Com-
“A Systematic Methodology to Compute the Architectural Vulnerability puters,” IEEE Transactions on Computers, Vol. 42, No. 12, 1993,
Factors for a High-Performance Microprocessor,” Proceedings. 36th pp. 1411–1424.
Annual IEEE/ACM International Symposium on Microarchitecture, https://doi.org/10.1109/12.260632
2003. MICRO-36, IEEE Publ., Piscataway, NJ, 2003, pp. 29–40. [85] “RISC-V ‘V’ Vector Extension, Version 0.8,” 2019, https://github.com/
https://doi.org/10.1109/MICRO.2003.1253181 riscv/riscv-v-spec/releases/download/0.8/riscv-v-spec-0.8.pdf
[69] Ebrahimi, M., Evans, A., Tahoori, M. B., Costenaro, E., Alexandrescu, [retrieved 5 Nov. 2020].
D., Chandra, V., and Seyyedi, R., “Comprehensive Analysis of [86] Abtahi, T., Shea, C., Kulkarni, A., and Mohsenin, T., “Accelerating
Sequential and Combinational Soft Errors in an Embedded Processor,” Convolutional Neural Network With FFT on Embedded Hardware,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
and Systems, Vol. 34, No. 10, 2015, pp. 1586–1599. Vol. 26, No. 9, 2018, pp. 1737–1749.
https://doi.org/10.1109/TCAD.2015.2422845 https://doi.org/10.1109/TVLSI.2018.2825145
570 DI MASCIO ET AL.

[87] Wang, S., Hu, J., and Ziavras, S. G., “On the Characterization of Data [89] Fernández, M., Gioiosa, R., Quiñones, E., Fossati, L., Zulianello, M.,
Cache Vulnerability in High-Performance Embedded Microproces- and Cazorla, F. J., “Assessing the Suitability of the NGMP Multi-Core
sors,” 2006 International Conference on Embedded Computer Systems: Processor in the Space Domain,” Proceedings of the Tenth ACM
Architectures, Modeling and Simulation, IEEE Publ., Piscataway, NJ, International Conference on Embedded Software, Assoc. for Comput-
2006, pp. 14–20. ing Machinery, New York, 2012, pp. 175–184.
https://doi.org/10.1109/ICSAMOS.2006.300803 https://doi.org/10.1145/2380356.2380389
[88] Sadler, N. N., and Sorin, D. J., “Choosing an Error Protection Scheme
for a Microprocessor’s L1 Data Cache,” 2006 International Conference Z. Sunberg
on Computer Design, IEEE Publ., Piscataway, NJ, 2006, pp. 499–505. Associate Editor
https://doi.org/10.1109/ICCD.2006.4380862
Downloaded by Indian Institute of Science on December 29, 2023 | http://arc.aiaa.org | DOI: 10.2514/1.I010916

PLNB1BRE 202402010147235 Customer
No ratings yet
PLNB1BRE 202402010147235 Customer
6 pages
FPGA Seminar PPT-1
100% (1)
FPGA Seminar PPT-1
30 pages
Towards The Use of Arti Ficial Intelligence On The Edge in Space Systems: Challenges and Opportunities
No ratings yet
Towards The Use of Arti Ficial Intelligence On The Edge in Space Systems: Challenges and Opportunities
13 pages
Leveraging The Openness and Modularity of RISC-V in Space: Ournal of Erospace Nformation Ystems
No ratings yet
Leveraging The Openness and Modularity of RISC-V in Space: Ournal of Erospace Nformation Ystems
19 pages
Current Technology in Space v4 Briefing
No ratings yet
Current Technology in Space v4 Briefing
12 pages
On-Board Processing Benchmarks
No ratings yet
On-Board Processing Benchmarks
9 pages
High-Performance Embedded Computing in Space
No ratings yet
High-Performance Embedded Computing in Space
18 pages
Identification of Research Problem
No ratings yet
Identification of Research Problem
14 pages
Sabo Gal 2019
No ratings yet
Sabo Gal 2019
12 pages
Lovelly Et Al 2018 Benchmarking Analysis of Space Grade Central Processing Units and Field Programmable Gate Arrays
No ratings yet
Lovelly Et Al 2018 Benchmarking Analysis of Space Grade Central Processing Units and Field Programmable Gate Arrays
12 pages
Electronics 09 00175
No ratings yet
Electronics 09 00175
12 pages
GPU4S Benchmark
No ratings yet
GPU4S Benchmark
6 pages
Machine-Learning Space Applications On SmallSat Platforms With Te
No ratings yet
Machine-Learning Space Applications On SmallSat Platforms With Te
8 pages
Sabogal_TRETS20
No ratings yet
Sabogal_TRETS20
32 pages
A Comparison of Space Grade FPGAs Part 1
No ratings yet
A Comparison of Space Grade FPGAs Part 1
6 pages
3 - Artificial Intelligence For Trusted Autonomous Satellite Operations
No ratings yet
3 - Artificial Intelligence For Trusted Autonomous Satellite Operations
36 pages
Low-Cost FPGA Based Onboard Computer
No ratings yet
Low-Cost FPGA Based Onboard Computer
14 pages
Embedded Systems For Satellite Comm - Keynote Speech
No ratings yet
Embedded Systems For Satellite Comm - Keynote Speech
12 pages
Applications Enabled by FPGA-Based Technology
No ratings yet
Applications Enabled by FPGA-Based Technology
4 pages
The METASAT Hardware Platform A High-Performance Multicore AI SIMD and GPU RISC-V Platform For On-Board Processing
No ratings yet
The METASAT Hardware Platform A High-Performance Multicore AI SIMD and GPU RISC-V Platform For On-Board Processing
6 pages
Spacecraft Computer Systems: Colonel John E. Keesee
No ratings yet
Spacecraft Computer Systems: Colonel John E. Keesee
34 pages
A European Roadmap To Leverage RISC-V in Space Applications
No ratings yet
A European Roadmap To Leverage RISC-V in Space Applications
7 pages
Gpu 4 S
No ratings yet
Gpu 4 S
6 pages
Spacecraft Computer System
No ratings yet
Spacecraft Computer System
33 pages
SoftwareProtectionaginstRadiationbvxy NBVCXZXZN V
No ratings yet
SoftwareProtectionaginstRadiationbvxy NBVCXZXZN V
9 pages
Hardware Review Magne Normann PDF
No ratings yet
Hardware Review Magne Normann PDF
52 pages
Research Proposal
No ratings yet
Research Proposal
11 pages
Solivieri
No ratings yet
Solivieri
245 pages
FPGA Applications in Space
No ratings yet
FPGA Applications in Space
5 pages
Design and Development of A CubeSat Hardware Architecture With COTS MPSoC Using Radiation Mitigation Techniques
No ratings yet
Design and Development of A CubeSat Hardware Architecture With COTS MPSoC Using Radiation Mitigation Techniques
71 pages
CERN-THESIS-2018-030 High-Energy Physics Fault Tolerance Metrics and Testing Methodologies For SRAM Based FPGAs
No ratings yet
CERN-THESIS-2018-030 High-Energy Physics Fault Tolerance Metrics and Testing Methodologies For SRAM Based FPGAs
115 pages
Advanced Fault Tolerant Computing For Future Manned Space Missions
No ratings yet
Advanced Fault Tolerant Computing For Future Manned Space Missions
7 pages
Full Text 01
No ratings yet
Full Text 01
88 pages
A Survey of FPGA-Based Robotic Computing
No ratings yet
A Survey of FPGA-Based Robotic Computing
27 pages
Aeroconf RISC-V
No ratings yet
Aeroconf RISC-V
8 pages
Radiation-Tolerant Fpgas: Aerospace and Defense
No ratings yet
Radiation-Tolerant Fpgas: Aerospace and Defense
24 pages
D3-04-METASAT by Leonidas Kosmidis
No ratings yet
D3-04-METASAT by Leonidas Kosmidis
31 pages
In-House Developed 32-Bit Digital Signal Processor For Strategic Applications
No ratings yet
In-House Developed 32-Bit Digital Signal Processor For Strategic Applications
5 pages
Validation of Mission Critical Power and Control Systems For Lunar Settlement
No ratings yet
Validation of Mission Critical Power and Control Systems For Lunar Settlement
34 pages
Development of An On-Board Computer For A Nanosatellite: Intelligent Robotics"
No ratings yet
Development of An On-Board Computer For A Nanosatellite: Intelligent Robotics"
74 pages
Analysis of The Performance of The New Generation of 32-Bit Microcontrollers For IoT and Big Data Application
No ratings yet
Analysis of The Performance of The New Generation of 32-Bit Microcontrollers For IoT and Big Data Application
7 pages
Tensor FPGA
No ratings yet
Tensor FPGA
24 pages
Spacevpxfmc For Electronics Standardization and Modularity in High-Performance Smallsat Architectures
No ratings yet
Spacevpxfmc For Electronics Standardization and Modularity in High-Performance Smallsat Architectures
8 pages
Quiz-1 Syllabus of Embedded Systems Design
No ratings yet
Quiz-1 Syllabus of Embedded Systems Design
20 pages
Instant Access to (Ebook) Design of FPGA-Based Computing Systems with OpenCL by Hasitha Muthumala Waidyasooriya, Masanori Hariyama, Kunio Uchiyama ISBN 9783319681603, 9783319681610, 3319681605, 3319681613 ebook Full Chapters
100% (9)
Instant Access to (Ebook) Design of FPGA-Based Computing Systems with OpenCL by Hasitha Muthumala Waidyasooriya, Masanori Hariyama, Kunio Uchiyama ISBN 9783319681603, 9783319681610, 3319681605, 3319681613 ebook Full Chapters
55 pages
13 Applications Xilinx
No ratings yet
13 Applications Xilinx
23 pages
Implementation of Global Navigation Satellite System Software Defined Radio Baseband Processing Algorithms in System On Chip
No ratings yet
Implementation of Global Navigation Satellite System Software Defined Radio Baseband Processing Algorithms in System On Chip
10 pages
dapnia-05-105
No ratings yet
dapnia-05-105
5 pages
05.03 OBDP2021 Steenari
No ratings yet
05.03 OBDP2021 Steenari
9 pages
On Component Reliability and System Reliability Fo
No ratings yet
On Component Reliability and System Reliability Fo
6 pages
Aeroespacial 2020
0% (1)
Aeroespacial 2020
28 pages
Field Programmable Gate Arrays Control Based Photovoltaic Energy Management
No ratings yet
Field Programmable Gate Arrays Control Based Photovoltaic Energy Management
4 pages
FPGA Based Flexible Autopilot Platform For Unmanned Systems: W. Alvis, S. Murthy, K. Valavanis, W. Moreno, S. Katkoori
No ratings yet
FPGA Based Flexible Autopilot Platform For Unmanned Systems: W. Alvis, S. Murthy, K. Valavanis, W. Moreno, S. Katkoori
9 pages
Where can buy Design and Applications of Emerging Computer Systems 1st Edition Weiqiang Liu ebook with cheap price
100% (1)
Where can buy Design and Applications of Emerging Computer Systems 1st Edition Weiqiang Liu ebook with cheap price
23 pages
DataProcessingDevicefinalpaper AL 2000 2007
No ratings yet
DataProcessingDevicefinalpaper AL 2000 2007
9 pages
Cpapameletis Thesis
No ratings yet
Cpapameletis Thesis
96 pages
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
No ratings yet
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
14 pages
A software-in-the-loop simulation...
No ratings yet
A software-in-the-loop simulation...
6 pages
04_abstract (1)
No ratings yet
04_abstract (1)
40 pages
IMPLEMENTATION OF A RADIATION-TOLERANT COMPUTER BASED ON A LEON3 ARCHITECTURE - 副本
No ratings yet
IMPLEMENTATION OF A RADIATION-TOLERANT COMPUTER BASED ON A LEON3 ARCHITECTURE - 副本
100 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Future Trends in Computer Architecture
No ratings yet
Future Trends in Computer Architecture
4 pages
Is RISC V Ready For Space A Security Perspective
No ratings yet
Is RISC V Ready For Space A Security Perspective
6 pages
Diannao Asplos2014
No ratings yet
Diannao Asplos2014
15 pages
Final
No ratings yet
Final
145 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
No Cs
No ratings yet
No Cs
3 pages
Sub 287
No ratings yet
Sub 287
16 pages
Machine Learning: What Is Data and Model? Machine Learning Workflow Distance Based Classifiers Bayes Decision Theory
No ratings yet
Machine Learning: What Is Data and Model? Machine Learning Workflow Distance Based Classifiers Bayes Decision Theory
81 pages
Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns
No ratings yet
Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns
127 pages
MP Set 092 17
No ratings yet
MP Set 092 17
14 pages
Machine Learning
No ratings yet
Machine Learning
64 pages
E0294 Scribe Lecture 9
No ratings yet
E0294 Scribe Lecture 9
24 pages
Introduction
No ratings yet
Introduction
10 pages
HW 4
No ratings yet
HW 4
1 page
Chap1-2 Markov Chain
No ratings yet
Chap1-2 Markov Chain
82 pages
Linear Programming: - Socrates
No ratings yet
Linear Programming: - Socrates
21 pages
Maxima
No ratings yet
Maxima
3 pages
Asg 0
No ratings yet
Asg 0
1 page
Skyline Daa 1
No ratings yet
Skyline Daa 1
8 pages
HighPerformanceSpaceflightComputing HPSC
No ratings yet
HighPerformanceSpaceflightComputing HPSC
19 pages
List of Projects
No ratings yet
List of Projects
1 page
Hardware Is The New Software
No ratings yet
Hardware Is The New Software
8 pages
Adjacency Matrix To List
No ratings yet
Adjacency Matrix To List
2 pages
Homework 3 2005
No ratings yet
Homework 3 2005
2 pages
20 Induction
No ratings yet
20 Induction
25 pages
Homework 8 2005
No ratings yet
Homework 8 2005
1 page
Homework 9 2005
No ratings yet
Homework 9 2005
1 page

Di Mascio Et Al 2021 On Board Decision Making in Space With Deep Neural Networks and Risc V Vector Processors

Uploaded by

Di Mascio Et Al 2021 On Board Decision Making in Space With Deep Neural Networks and Risc V Vector Processors

Uploaded by

JOURNAL OF AEROSPACE INFORMATION SYSTEMS

Vol. 18, No. 8, August 2021

On-Board Decision Making in Space with Deep Neural

Stefano Di Mascio,∗ Alessandra Menicucci,† and Eberhard Gill‡

I. Introduction environment) is proportional to its area (i.e., the number of sequential

execution of operations) achieves a 6.23× speed-up for a small A. Downlink Efficiency

the amount of useful data transmitted in the two cases is then

the status of the satellite and taking autonomous decisions when no

ranging from ultra-blue to thermal infrared. A large number of

#FLOP U bW − J  2P∕Sc  1 (2)

Fig. 4 Unrolling of a convolution (similarly to Ref. [37]).

Table 1 Workload characterization for the first 15 layers of CloudNet [34]

x^i p (4)

Fig. 6 Block diagram of a decoupled vector pipeline.

n operands per cycle to the shared functional units after an initial

point (expected to reside in the memory buffer) during the inference.

B. L1 Vector Cache As we are interested in investigating the speed increase due to

AE50;40 6.24E − 1 7.07E − 1 7:44E − 1 7.10E − 1 5.74E − 1 4.15E − 1

Peak values in bold.

2012, pp. 50–68. Workshop on Frontiers in Handwriting Recognition, edited by G.

You might also like

#FLOP U bW − J 2P∕Sc 1 (2)