0% found this document useful (0 votes)

13 views9 pages

Extended

Uploaded by

Boul chandra Garai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views9 pages

Extended

Uploaded by

Boul chandra Garai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

1

FPGA Multi-Processor for Sparse Matrix

Applications
João Pinhão, IST

Abstract—In this thesis it is proposed a novel computational regarding the position of each in the matrix, having its total
efficient multi-core architecture for solving sparse matrix-vector size proportional to the number of nonzero elements.
multiply (SpMV) in FPGA. The efficient implementation of As an example of applicability in most of the fields
SpMV is challenging, as simple implementations of the kernels
typically give a performance that is only a fraction of the peak. named above, solving a partial differential equation using
At the center of this problem is that sparse operations are finite elements method boils down to solving a system of
more bandwidth-bound than dense ones. Previous works on the linear equations of the form y = Ax, where y and x are
subject suggest that the use of FPGA for solving SpMV can vectors and A is a large matrix that is mostly composed of
improve performance levels when compared to the use of General zero entries. Nonzero elements of A would be arranged in
Purpose Processors (GPPs), thereby improving total execution
times when solving problems largely dependent of SpMV. As a regular or irregular pattern depending on the selection of
such, in this work the existing sparse matrix compression/storing a structured or unstructured mesh for discretization of the
formats are analyzed, their feasibility for an efficient imple- original problem [1, 2]. The efficient implementation of these
mentation is verified, and lastly a multi-processor architecture operations is extremely important; however, it is challenging
is proposed, capable of better using the available bandwidth in as well, as simple implementations of the kernels typically
order to achieve higher levels of performance. Through extensive
experimentation on a wide dataset, the architecture exhibits the give a performance that is only a fraction of peak [3, 4]. The
ability to outperform GPPs both in terms of peak and average center of its performance problem is that sparse operations are
performance. Given the target platform (Zynq), the performance more bandwidth-bound than dense ones [5, 6]. Consequently,
of the developed architecture was compared to the ARM Cortex- optimizations are needed, but these are much dependent on
A9 GPP present the platform. The architecture has shown to architectural variations even between closely related versions
achieve an average of 78.5% (624.23 MFLOPS) performance with
a peak of 98.2% (785.23 MFLOPS). On the other hand, the ARM of the same processor. To this end, a number of optimization
processor attained an average of 141.53 MFLOPS, equivalent to techniques have been proposed, such as register and cache
21.8% of its peak performance, while achieving a peak of 204.11 blocking [7, 8], and column or row reordering [9].
MFLOPS (31.4%) of performance in one test. This translates The main contribution of this thesis is a novel approach to
into a performance improvement of 2.85−5.45×, averaging at the efficient computation of the Sparse Matrix-Vector Multiply
4.48× the performance of the ARM processor tested.
(SpMV) problem. To perform this, a reconfigurable multi-
Index Terms—Sparse Matrices, FPGA, Multi-Core Architec- processor architecture has been researched and developed,
ture
capable of sustaining high peak performance by avoiding the
common pitfalls that affect SpMV computation and therefore
I. I NTRODUCTION achieving better computational efficiency. This was only made
Sparse Linear algebra computations such as the matrix- possible after a thorough analysis of the storing formats
vector product or the solution of sparse linear systems are often and by efficiently exploiting the advantages provided by
the bottleneck of many scientific fields from computing fluid programmable logic, such as reduced costs for reading and
dynamics to structural engineering, electromagnetic analysis writing to local memory, pipelining the datapaths, etc. To
or even the study of economy models, to name a few. The complement the developed architecture and in order to further
task of computation regularly falls to CPUs and, due to the reduce the bandwidth required to perform SpMV, a novel
evolution in this field, performance is improving. However, the storing format named RCSC is also proposed. Storing the
task of obtaining peak performance from modern cache-based sparse matrix according to this format further reduces the
computational systems has proven to be extremely difficult. required bandwidth to perform SpMV without increasing the
Several factors contribute to this low performance such as storage space needed, keeping in line with formats known to
the underlying machine architecture, memory access behavior, reduce size to a minimum. The use of this format removes the
compiler technology and the nature of the input matrix which irregular memory accesses that limit the SpMV performance
may only be known at runtime. In sparse matrices, the fraction allowing for a streamline approach to data transfer. The
of nonzero elements is small when compared to the total removal of irregular accesses is the most important perfor-
number of elements. While it is possible to use generic mance improvement factor for the SpMV problem, where the
data structures and routines to perform computations with possibility of data reuse is very limited.
such matrices, this is inefficient as most calculations on zero The architecture is capable of receiving data via DMA [10]
elements are redundant and sometimes even impractical due streaming and equally partition the computation amongst the
to the large dimensions of these matrices. In practice, sparse implemented PE. These are capable of producing one result
matrices are stored using specialized data structures that only per clock cycle due to their pipelined FMA [11]. Although
store the nonzero values, along with additional information data retrieval from external memory and computation in the
2

PEs is fully overlapped, due to the unknown sparsity pattern of ordering the nonzero elements in arrays. Instead of a column
the matrix the process of writing the output vector to external index, a row index is stored for each value, as are column
memory cannot be overlapped. pointers. This means the CSC format is composed of vectors
The architecture was implemented on an evaluation and V AL, ROW IN D and COL P T R, where V AL is an array
development board based on the Xilinx Zynq-7020 All Pro- of the (top to bottom then left to right) nonzero values of the
grammable (AP) System-on-Chip (SoC). The Processing Sys- sparse matrix; ROW IN D is the row indexes corresponding
tem (PS) of the Zynq-7020 device can provide a bandwidth to the values in V AL; and COL P T R contains the index
of 1600MB/s which allows for an architecture composed of in A of the first nonzero element of column j. Due to its
4 PE to be implemented. Results wise, an average of 624 efficiency in arithmetic operations, column slicing, and matrix-
MFLOPS and 78.5% efficiency is attained for a dataset of vector products, this is the traditional format for specifying a
14 sparse matrices and input vectors. When compared to the sparse matrix in MATrix LABoratory (MATLAB). A depiction
performance of the ARM Cortex-A9, present in the PS side of of sparse matrix A of figure 1 compressed in CSR format is
the Zynq, execution times are improved by 2.85−5.45×, while shown in figure 3.
efficiency of the ARM GPP is on average 21.8%, equivalent
to 142 MFLOPS. 
11 0 0 12

61 0 0 0
A= 
II. S PARSE M ATRIX A LGORITHMS 0 72 0 0
A. Storing Formats 0 0 33 0
Fig. 1. Sparse matrix A
Enormous effort has been devoted to devise data storing
formats with the aim of maximizing performance. This means
that to fully optimize SpMV computation, we need to choose a
compression algorithm that takes the sparse matrices structures V AL = [11 12 61 72 33]
into account. The focus of this section is to briefly describe COL IN D = [0 3 0 1 2]
the most common compression/storage formats available. The ROW P T R = [0 2 3 4 5]
purpose of all different format variations is either to improve Fig. 2. Sparse matrix A in CSR format
the architectures’ ability to access data and perform computa-
tions or to reduce the total space required to store the matrix.
The most common format regardless of the used processor is
V AL = [11 61 72 33 12]
CSR, which stores the nonzeros row-wise. Another favorite for
ROW IN D = [0 1 2 3 0]
implementations in programmable logic fabric is CSC, which
COL P T R = [0 2 3 4 5]
stores the nonzeros column-wise. The use of other storing
formats depends on the processor used, as ELLPACK and S- Fig. 3. Sparse matrix A in CSC format
ELLPACK are most useful in GPU implementations due to
their SIMD capabilities. When comparing both of the previous formats, and assum-
1) CSR: The Compressed Sparse Row (CSR) format stores ing all vectors related to the compression format and vector x
an initial M × N sparse matrix A in row form using three are stored in memory, CSC provides less accesses to memory
one-dimensional arrays. Let nz denote the number of nonzero when compared to CSR since the values of vector x are reused
elements of A. The first array is called V AL and is of in the computation of each nonzero element belonging to the
length nz . This array holds the values of all the nonzero same column. For format CSR more values have to be trans-
entries of A in left-to-right then top-to-bottom (row-major) fered from memory for each computation. First, two values
order. The second array, COL IN D, contains the column of vector ROW P T R need to be transfered. By computing
index (zero-based) of each element of V AL. The third and ROW P T R[i + 1] − ROW P T R[i], the number of nonzero
final array, ROW P T R, is of length M + 1 (i.e. one entry elements in the row is obtained. Then, for each (value, index)-
per row, plus one). ROW P T R(i) contains the index in pair of nonzero elements transfered, the respective x[index]
A of the first nonzero element of row i. Row i of the needs to be retrieved. This not only translates into a different
original matrix A extends from V AL(ROW P T R(i)) to value of vector x for each nonzero in every row, but also a
V AL(ROW P T R(i + 1) − 1), i.e. from the start of one delay in memory transfers and isolated accesses. This delay
row to the last index before the start of the next. The last is caused by the necessary sequence in transfers of nonzero
entry in ROW P T R (at zero-based index m) must be the indexes and values while the isolated transfers are caused by
number of elements in V AL (i.e. the number of nonzero the values of x not being necessarily adjacent and, as such,
elements of A). The name of this format is based on the burst reading or caching of values not being very useful.
fact that row index information is compressed, which delivers These problems do not occur when dealing with a sparse
increased efficiency in arithmetic operations and row slicing matrix stored in CSC format. For every nonzero value and
partitioning algorithms. A depiction of sparse matrix A of index transfered, values of x needed are reused for each
figure 1 compressed in CSR format is shown in figure 2. column analyzed. This leads to improvements when using
2) CSC: Compressed Sparse Column (CSC) format is anal- cache and burst readings from memory, an important factor
ogous to CSR except that it follow a column-major order for in SpMV.
3

B. SpMV Algorithms proposed which aim to improve the computation of SpMV.

Based on these improvements and the RCSC storing format,
Most of the previous relevant works on this area have been the development of a reconfigurable architecture capable of
done on General Purpose Processors (GPPs) and GPUs, where achieving high computation efficiency and reduced execution
improvements can mostly be done regarding the data stored time is detailed.
in cache or in local GPU memory. An analysis of five FPGA-
based architectures indicated that performance improvements III. A RCHITECTURE OVERVIEW
to the SpMV problem depend on efficiently exploring the stor-
A. Hardware/Software Interface Evaluation
ing formats while minimizing the communication overheads.
By approaching the problem in a row-wise fashion, two of In order to achieve the best possible utilization of the
the five analyzed works [12, 13] achieved low computational available bandwidth to ensure an optimized SpMV accelerator,
efficiencies due to limitations in scaling the implemented the capabilities if the target device, the Zynq-7000, had to
design. Another work [14] based on format CSR achieved a be evaluated. The available interfaces for integrating PL-
higher efficiency at the cost of locally storing a copy of vector based accelerators and the PS are: the AXI ACP, AXI HP
x in each PE and therefore eliminating indirect references to and AXI GP. The AXI High Performance (HP) interface is
this vector. The two remaining works [15, 16] approached the composed of four high performance/high bandwidth master
problem column-wise by choosing CSC and SPAR formats, ports capable of 32-bit or 64-bit data transfers with separate
respectively. By reducing the indirect references to vector x, read/write command issuing for concurrent operations, medi-
the use of a column-major storing format provided a higher ated by an asynchronous 1KB FIFO. Each of these interfaces
computational efficiency and improved the scalability of the can provide at most 1200MB/s when 64-bit wide, just as
architecture without increasing local memory requirements. the AXI ACP, or 600MB/s for 32-bit wide. When using all
In order to develop an efficient SpMV architecture, a four high performance ports, the total available throughput is
suitable storing format must be chosen. The choice lies be- 4800MB/s when using 64-bit mode or 2400MB/s when in 32-
tween row-major, column-major and other formats. This last bit mode. All these throughputs assume a clock frequency of
option includes formats that imply variable arguments, i.e. a 150MHz. Hence, using the four available High Performance
variable block size, which is only known at runtime. Therefore Ports to transfer data to/from an architecture implemented in
a hardware implementation of an SpMV accelerator based the PL and external memory (i.e. DDR) provides four 32-
on these formats is likely to perform poorly. As such, by bit words per clock cycle and an average throughput of 2538
discarding these variable formats, the choice is now limited MB/s.
to row-major formats or column-major formats. The deciding
factor is given by the ability of each format in minimizing one B. Workload Balancing
of the major problems in efficient SpMV computation: indirect No sparsity pattern is assumed for the input matrices. As
memory references. If a sparse matrix is stored in a row- such, data transfers and computational parallelism needs to be
major format, e.g. CSR, the nonzero column indexes are also chosen and scheduled accordingly. The most common is the
stored. These are the cause of the indirect memory references, diagonal pattern 4(a) in which all nonzero values are located
as x[index] = xBASE ADDRESS + index is performed for in the main matrix diagonal and/or scattered alongside it. It
each nonzero element in the matrix, and the index value is is included in the denomination of diagonal pattern matrices
not likely to follow any pattern. Another factor to consider consisting of patterns with 3 diagonals 4(b), in which the
is that 3 load operations precede each nonzero computation. nonzero values are located across 3 diagonals, one primary
One to retrieve the nonzero element, another for the respective and two secondary. As such, assigning row ranges per PE
index, and one for the value of x given by the nonzero index. results in an idle system the majority of the time. To solve this,
In turn, by assuming a sparse matrix stored in a column- single rows were alternately assigned to each processor in a
major format, e.g CSC, the indexes represent the row of round-robin scheme. This guarantees a good workload balance
each nonzero element. In an FPGA-based implementation, amongst the existing processors, as evidenced in figure 5, and
by storing the output vector y in local memory consisting thus, given a good hardware implementation of each PE, it is
of BRAMs, these irregular accesses cost one clock cycle, expected that the least number of PEs are idle for the least
as opposed to several in external memory. As such, it is amount of time.
reasonable to say that a column-major format is the most
suitable for our implementation. The proposal herein is to
modify the CSC format in which the sparse matrix is stored,
such that only 2 vectors are used to represent the matrix-vector
data, instead of 4. Vector x and vector V AL are merged in
one vector still named V AL X, and vectors ROW IN D and
COL P T R are merged in one vector named IN D P T R.
Given the reasons presented above, an important number (a) Sparse Matrix (b) Sparse Matrix
of improvements, such as the use of FMA units in order to cdde1 ck656
pipeline computations, RAW hazard detection and methods of Fig. 4. Diagonal Sparse Matrices cdde1 and ck656 from [17]
workload division across a configurable number of PEs, are
4

to maximize the precision of the multiply-add sequence. The

FMA is implemented as an 8-stage pipeline, and can produce
one result per clock cycle. When processing a row of the sparse
matrix in a PE, a Read After Write (RAW) hazard may occur
when the value entering is read before it is updated in local
memory. To solve these, the delay block, whose purpose is to
delay the address of local memory to be updated, is altered
to indicate if a computation entering the FMA creates a RAW
Hazard. When this occurs, stalls are implemented until the
Fig. 5. Nonzero assignment of a sample of 7 sparse matrices to a 4 PE system
(percentage of total nonzero elements of the respective matrix) address is updated with the correct value.
Testing of the architecture has shown that a limitation exists
in the number of PEs that can be fed with data by the
C. Architecture Packager. Given the need to verify a RAW hazard before each
computation, the PEs requires two clock cycles per computa-
tion entering the FMA. Given that the Packager instructs one
computation per clock cycle to the Linear Processor Array,
more than two PEs resulted in idle processors.
The novel format (RCSC) proposed in this work requires
a lower bandwidth per accelerator and therefore allows to
increase the number of parallel accelerators for the same fixed
bandwidth. Given the HP interface capability of providing
128-bits per clock cycle and the developed SpMV accelerator
requiring 64-bits, two accelerators can be implemented in the
Fig. 6. RCSC Architecture Block Diagram
target device. Therefore, the best partitioning scheme between
accelerators was also studied. We concluded that partitioning
the sparse matrix in row blocks and assigning each to an
accelerator lead to an increase in the number of RAW hazards
in the blocks. Another limitation was the need to pre-process
the matrix to construct these blocks, resulting in both more
execution time and more external memory usage. On the
other hand, partitioning the matrix in column blocks resulted
in almost no pre-processing, as RCSC is already a column-
major format. The processing of each block results in a partial
vector per accelerator, with the final output vector y being
the accumulation of all partial vectors. The best process of
accumulating these vectors was also studied. Both options for
reduction circuits, an Adder Tree and a Queues Cascade as
pictured in figure 8, were studied. Given the execution times of
Fig. 7. Processing Element both options, the Adder Tree was chosen as the most suitable
to sum all partial vectors.
Addressing the problems inherent to SpMV and, in conjunc-
tion with an exploration of the target device capabilities, an
IV. P ERFORMANCE E VALUATION
architecture that seeks to solve these problems was developed.
This involved scheduling data transfers and computational To verify the functionality, the architecture was imple-
parallelism accordingly. Through the methodical process of mented in the target device, the Zynq-7020. Figure 9 also
testing the architecture, an efficient and scalable architecture includes the NEON MPE and VFPv3 cores of the Application
was developed, capable of processing any sparse matrix and Processing Unit (APU), as well as the timer. The NEON and
input vector. The SpMV Accelerator (figure 6) consists of a the Vector Floating-Point s(VFP) cores extend the functional-
Packager Unit, that processes and distributes the matrix/vector ity of the ARM processor by providing support for advanced
data to a Linear Processing Array, composed by p PEs SIMD and vector floating-point v3 instruction sets. Some
connected linearly. Data is received at the Packager through of the features of the NEON MPE core are SIMD vector
two DMA units connected to two HP ports. Each Processing and scalar single-precision floating-point computations. The
Element is composed of an input FIFO meant to store the vectors supported are 128-bit wide due to the registers used but
instructions sent by the Packager to be processed, a delay will execute with 64-bits at a time. Since both Advanced SIMD
block, a Fused Multiply-Add (FMA) [11] operator and local and VFPv3 are implemented, they share a 32 64-bit register
memory. Figure 7 shows a block diagram of each PE. The bank. The inclusion of these cores on the schematic seen in
FMA performs single precision floating-point operations, only figure 9 is tied to the optimizations done to the code run by the
rounding and normalizing the result at the final stage, in order ARM processor. To compare computational times required to
5

(a) Adder Tree

Fig. 9. Implementation of two SpMV Accelerators

interfaces, which are connected to the AXI HP and AXI

GP ports respectively. For each resource is also presented a
percentage of the total available in the device.

(b) Queues Cascade TABLE I

R ESOURCE UTILIZATION OF TWO S P MV ACCELERATORS PLUS
Fig. 8. Methods of reducing vector y in Column Blocks for a system REDUCTION CIRCUIT AND SYSTEM OVERALL
composed of k SpMV Accelerators
Resource Total Accelerators + Reduction System
Registers 106400 3876 (3,64%) 9473 (8,90%)
obtain both hardware and software results, the ARM processor
LUTs 53200 5480 (10,30%) 10702 (20,12%)
was made sure to fully utilize the available resources. This
BRAMs 140 8 (5,71%) 36 (25,71%)
implies using compiler high optimization levels (flags -O3
DSP48 220 8 (3,64%) 8 (3,64%)
or -Otime), using NEON and VFP SIMD instructions (flag
–mfpu=neon-vfpv3), taking advantage of L1 and L2 caches,
loop unrolling, function inlinning, and overall improvements The ability to scale the architecture is limited by the avail-
to the algorithm that result in a reduced execution time. When able memory bandwidth. For each additional accelerator, an
comparing both hardware and software results, due to the additional 64-bits per clock cycle are required. The number of
use of a Fused Multiply-Adder in the PEs, several results clock cycles required to perform sparse matrix-vector multiply
are shown to be slightly different. This is to be expected, as on both the ARM processor (with all possible improvements)
the FMA unit performs a single rounding at the end of the and the architecture implemented in programmable logic was
fused multiplication-addition and the NEON unit truncates and measured. This measurement was performed by setting the
then rounds the result. This yields a more accurate result from timer to zero right before the start of computation with a
the accelerator when compared to the ARM processor, as is posterior value retrieval. As such, the time required to compute
specified in the IEEE standard for Floating-Point Arithmetic a matrix was not directly measured, but rather the number of
754. cycles passed in the timer during said computation. With the
division of this value by the frequency of the timer (355MHz)
the execution time can be obtained. The measurement of
A. Experimental Results the time required to compute several sparse matrices from
In terms of resource utilization and working frequency, two the University of Florida Sparse Matrix Collection [17] was
cases are presented in table I: resource utilization considering measured using the above described process, resulting in the
only the SpMV Accelerators (composed of two Packagers values presented in table II. Several sparse matrices were
and four PEs), and total resource utilization considering the tested. The purpose of each test was to verify if the compu-
whole system, i.e. the SpMV Accelerators, four DMA cores tation performed using the developed accelerator resulted in a
and AXI Interconnects for the AXI4-FULL and AXI4-LITE lower execution time when compared to the same computation
6

performed in the ARM processor. To verify the reliability of gives the minimum number of clock cycles required for the
the accelerator to process any sparse matrix, the chosen dataset nonzero element to enter the FMA unit when no hazard is
II included matrices whose parameters vary greatly. From detected. The value of variable W orst stall is given by the
square (N ×N , e.g. bcsstm27) to rectangular matrices (M ×N , number of pipeline stages of the FMA. In this architecture this
e.g. Flower 5 4) and diverse sparsity values ranging from as value is always equal to 8. N o stall is 2 as one clock cycle is
low as 0.02% (e.g. OPF 6000) to 23.08% (e.g. bibd 14 7). required to read an address from the input FIFO and another
Within all matrices the number of hazards varies greatly, for the corresponding y element to be at the output of local
from zero (e.g. bibd 14 7) to 47.46% (e.g. N pid). Each memory.
of these parameters influences computational time, although As one SpMV Accelerator is composed of one Packager and
only the accelerators are affected by RAW hazards, as the two PEs, and given that operations in the Packager and the PEs
ARM processor follows a distinct execution path. Attention are overlapped, the total number of clock cycles required to
was also given to the existing patterns in the matrices and process a nonzero element is given by the component with the
to the field of science each matrix belongs as to include the lowest throughput, the Packager or the two PEs, as shown by
most possible types in the dataset. The accelerator achieves equation 3.
an average performance of 624 MFLOPS for a computational
efficiency of 78.03%. This can be seen in figure 10 where the Accelerator [cycles/nz] =max P ackager [cycles/nz] ;
performance of both the ARM processor and the architecture
are represented in terms of achieved MFLOPS per matrix in P E [cycles/nz]
the dataset. The respective peak performances are represented 2
as well, clearly showing that the ARM processor achieves a (3)
performance level well below peak. The estimate for the total execution time must also consider
the transfer of all elements of vector y to external memory.
B. Analytical Model For this, the internal DMA FIFOs with a fixed length of 512
elements need to be filled with the maximum burst length (256
An analytical model was developed in order to predict the
elements) before the transfer is started. After the first burst of
performance of the architecture with larger sparse matrices,
256 elements, the process of filling the FIFO and transmitting
and when using more SpMV Accelerators. Given that the
the elements of vector y is performed simultaneously. With
transfer of data from the external memory and computation in
this latency accounted for, the execution time for an imple-
the SpMV Accelerator are performed concurrently, execution
mentation of k accelerators can be estimated by equation 4,
time depends mostly on the communication performance. To
where m is the number of elements of vector y, f represents
develop an analytical model each component that constitutes
the frequency at which the SpMV Accelerator works in MHz,
the implemented accelerators was modeled by determining the
and the reduction circuit now contributes to a larger portion of
number of clock cycles required to process a nonzero element.
execution time as it adds a term dependent on the number of
For the Packager unit, as one clock cycle is required per
pipeline stages of the adder used (Adder stages) times the
nonzero and structural element of the matrix, the total number
number of reduction levels (dlog2 (k)e), as previously depicted
of cycles required to process the entire sparse matrix is given
in figure 8:
by nz + n. As one instruction is sent to the PEs per nonzero
element, the number of clock cycles required per nonzero T heo exec time [us] =
element is given by equation 1. nz

P E[cycles/nz]

nz + n × max P ackager [cycles/nz] ; +
P ackager [cycles/nz] = (1) k 2
nz !
1
m + Adder stages × dlog2 (k)e + 256 ×
The number of cycles required by the PEs to perform com- f
putation depends on the number and distribution of hazards (4)
within the matrix. As no pattern can be assumed for the
nonzero elements within a sparse matrix, a simple Bernoulli To validate the developed model, estimated execution times
distribution is assumed for the occurrence of hazards. As were calculated for several sparse matrices using equation 4
such, the number of cycles required by each PE to process and compared to the values measured from execution in the
the nonzero elements arriving via the Packager is given by target device. Table III shows both experimental and estimated
equation 2: execution times (in microseconds), as well as the error between
measurement and estimation.
P E [cycles/nz] =W orst stall × phazard
(2)
+ N o stall × (1 − phazard ) C. Estimation for k accelerators
where phazard is the probability of a hazard in the sparse Assuming a target device with enough available bandwidth
matrix (equal to zero when no hazards occur and equal to one to feed a number k > 2 of accelerators, the analytical model
when a hazard exists for every nonzero element), W orst stall can estimate the execution times for any sparse matrix. Exe-
represents the highest number of clock cycles the PE needs to cution times were estimated for all matrices that constitute the
be stalled in order for the hazard to disperse and N o stall dataset, although, in order to avoid redundancy, matrices that
7

TABLE II
E XECUTION TIME REQUIRED TO COMPUTE SPARSE MATRICES IN BOTH THE ARM AND ONE S P MV ACCELERATOR AND RESPECTIVE OBTAINED SPEEDUP
AND ARCHITECTURAL EFFICIENCY

Matrix Rows Columns Sparsity (%) Hazards (%) ARM (us) Accelerator (us) Speedup Efficiency (%)
MK9-B3 945 1260 0.32 14.23 139 38 3.66 75.5
MK10-B4 945 4725 0.11 25.48 268 61 4.42 56.3
Maragal 2 555 350 2.24 2.50 128 32 3.95 93.7
Flower 5 4 5226 14721 0.06 3.17 1644 350 4.70 77.1
bibd 14 7 91 3432 23.08 0.00 2055 405 4.98 95.5
LP Pilot87 2030 6680 0.55 8.00 2222 489 4.55 81.1
bcsstm27 1224 1224 1.91 0.13 803 163 4.92 96.1
qc2534 2534 2534 3.63 0.17 6274 1215 5.16 98.2
big 13209 13209 0.05 28.26 3122 967 3.23 57.5
adder dcop 64 1813 1813 0.33 21.56 365 106 3.45 64.8
p2p gnutella08 6301 6301 0.05 20.45 532 233 2.29 68.2
N pid 3625 3923 0.06 47.17 341 132 2.59 51.2
OPF 6000 29902 29902 0.02 44.19 5104 1885 2.71 47.7
SiNa 5743 5743 0.31 1.63 3021 599 5.05 95.0

Fig. 10. ARM and Hardware performance measurement in MFLOPS for each matrix in the dataset

TABLE III relatively small (n nz ), the time spent adding all partial
E XPERIMENTAL RESULTS VS . MODEL PREDICTIONS results of vector y from each accelerator in the reduction
Matrix HW exec (us) Theo exec (us) Error (%)
circuit is a small fraction of the total computational time. As
evidenced by table IV, speedup gains increase almost linearly
MK9-B3 38 39 2.82
with the number of implemented accelerators.
MK10-B4 61 59 2.11
Maragal 2 32 32 2.12 V. C OMPARISON TO EXISTING S P MV A RCHITECTURES
Flower 5 4 350 348 0.47 In table V, the performance of the implemented architecture
bibd 14 7 405 381 6.02 is compared to previous architectures. Naturally, sparsity of
LP Pilot87 489 488 0.25 the tested matrices influence the efficiency of the developed
bcsstm27 163 164 0.63 accelerator, as occurs on the remaining works. Sparse matrix
qc2534 1215 1205 0.80 parameters that adversely affect the implemented accelerator
big 967 980 1.36 performance include the number of rows as it increases the
adder dcop 64 106 110 3.69 size of the output vector required to be written to external
p2p gnutella08 233 233 0.19 memory, the ratio of matrix columns to nonzero elements as
N pid 132 136 3.45 it influences the performance of the Packager unit and the
OPF 6000 1885 1990 5.57 number of hazards within the matrix, as each requires the PE
SiNa 599 600 0.25 to stall the FMA core until the hazard is resolved.
All results were compared to a software-only execution in
the ARM, optimized to take full advantage of the existing
produced similar results are not shown in table III. The number vector floating-point unit (128-bit SIMD) and operating at a
of Accelerators was varied between three and seven and for frequency of 650MHz, more than six times the frequency at
comparison sake, results from implementations with one and which the programmable logic operates (100 MHz).
two accelerators are also included. Despite the length of the
reduction circuit increasing with the number of implemented VI. C ONCLUSION
accelerators (as given by the reduction time increasing with The results show that the architecture proposed in this work
k), as the number of elements passing through this block is is able to achieve an average of 624 MFLOPS (single precision
8

TABLE IV
H ARDWARE AND S OFTWARE EXECUTION TIMES , IN MICROSECONDS ( US ), AND RESPECTIVE SPEEDUP OBTAINED FOR A SYSTEM WITH k S P MV
ACCELERATORS USING C OLUMN B LOCKS

Number of mk10-b4 Maragal 2 Flower 5 4 bibd 14 7

Accelerators (k) HW ARM Speedup HW ARM Speedup HW ARM Speedup HW ARM Speedup
1 105 268 2.54 55 128 2.33 621 1644 2.65 811 2021 2.49
2 61 268 4.42 32 128 3.95 350 1644 4.70 381 2021 5.30
3 44 268 6.13 24 128 5.36 250 1644 6.56 255 2021 7.92
4 36 268 7.49 20 128 6.41 202 1644 8.15 192 2021 10.51
5 31 268 8.61 18 128 7.24 172 1644 9.54 155 2021 13.07
6 28 268 9.59 16 128 7.95 153 1644 10.76 129 2021 15.61
7 26 268 10.43 15 128 8.54 139 1644 11.85 111 2021 18.13

TABLE V
C OMPARISON OF FPGA S P MV A RCHITECTURES

[12] [13] [18] [14] [16] [15] This work

Virtex-5 Virtex-II Stratix-III Virtex-II Virtex-II Virtex-5 Zynq
FPGA
LX155T Pro 70 EP3E260 Pro 100 6000 SX95T XC7Z020
Frequency [MHz] 100 200 100 170 95 150 100
Memory Bw. [GB/s] 6.5 8 8.5 8.5 1.6 35.74 1.6
Number of PEs 16 4 6 5 3 64 4
Peak Perf. [GFLOPS] 3.2 1.6 1.2 1.7 0.57 19.2 0.8
Matrix Format CVBV CSR COO CSR SPAR CSC RCSC
Sparsity
Min - Max [%] 0.01-5.48 0.04-4.17 0.51-11.49 0.04-0.39 0.01-1.10 0.003-0.33 0.02-23.08
Average [%] 1.41 0.87 3.34 0.16 — 0.09 2.34
Efficiency
Min - Max [%] 1-7 20 - 79 5-7 50 - 98.4 1 - 74 69 - 99.8 54.5 - 98.2
Average [%] 4.4 42.6 5.6 79.4 55.6 91.9 78.5

floating-point). This corresponds to a performance efficiency ing of Blocked Sparse Kernels. In Technical Report
of 79%. These figures are better than those obtained by the ICL-UT-04-05.
ARM processor where 142 MFLOPS were measured, which [5] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick
corresponds to 22% of peak performance. This translates into and J. Demmel (2007). Optimization of Sparse Matrix-
a performance improvement of 4.39× on average. Given the Vector Multiplication on Emerging Multicore Platforms.
predictions of the analytical model, it is expected that the In Supercomputing.
architecture is able to scale with the available bandwidth. [6] J. D. Davis and E. S. Chung (2012). SpMV: A Memory-
Speedup values of 10.56× on average the execution in a gen- Bound Application on the GPU Stuck Between a Rock
eral purpose processor are to be expected, as long as enough and a Hard Place. In Microsoft Technical Report.
bandwidth is available, along with resources to implement the [7] E.-J. Im and K. Yelick (1999). Optimizing Sparse
architecture in programmable logic. Matrix Vector Multiplication on SMPs. In Ninth
SIAM Conference on Parallel Processing for Scientific
Computing (SIAM’1999). SIAM.
R EFERENCES [8] E.-J. Im and K. Yelick (1997). Optimizing sparse
matrix computations for register reuse in SPARSITY.
[1] C. Gfrerer (2012). Optimization Methods for Sparse In Proceedings of the International Conference on
Matrix-Vector Multiplication. In Seminar for Computer Computational Sciences - Part I, pages 127-136.
Science. [9] A. Pinar and M. T. Heath (1999). Improving per-
[2] Y. Saad (2003). Iterative methods for sparse linear sys- formance of sparse matrix-vector multiplication. In
tems. In SIAM International Conference on Data Mining Supercomputing.
(SIAM’2003). SIAM. [10] Wilson M. José, Ana Rita Silva, Mário P. Véstias,
[3] R. W. Vuduc and H.-J. Moon (2005). Fast sparse matrix- Horácio C. Neto (2014) Algorithm-oriented design of
vector multiplication by exploiting variable block struc- efficient many-core architectures applied to dense matrix
ture. In Proceedings of High-Performance Computing multiplication. In Analog Integrated Circuits and Signal
and Communications Conference (HPCC’2005), pages Processing, pages 9–16. IEEE.
807–816. [11] Mário P. Véstias (2014) DReaMaCHine - Design of a
[4] A. Buttari, V. Eijkhout, J. Langou and S. Filip- Reconfigurable Many-Core Architecture for High Perfor-
pone (2005). Performance Optimization and Model-
9

mance Computing. Instituto de Engenharia de Sistemas

e Computadores, Investigação e Desenvolvimento em
Lisboa (INESC ID/INESC/IST/ULisboa).
[12] S. Kestur, J.D. Davis and E.S. Chung (2012) Towards
a Universal FPGA Matrix-Vector Multiplication Archi-
tecture. In International Symposium Field-Programmable
Custom Computing Machines, (FCCM’2012)., pages
9–16. IEEE.
[13] L. Zhuo and V.K. Prasanna (2005). Sparse Matrix-
Vector multiplication on FPGAs. In Proceedings
of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA‘2005), pages
63-74.
[14] Yan Zhang, Y. H. Shalabi, R. Jain, K. K. Nagar
and J. D. Bakos (2009). FPGA vs. GPU for sparse
matrix vector multiply. In International Conference
of Field-Programmable Technology (FPT’2009). pages
255–262.
[15] R. Dorrance, F. Ren and D. Markovic (2014). A Scal-
able Sparse Matrix-Vector Multiplication Kernel for
Energy-Efficient Sparse-Blas on FPGAs. In Proceedings
of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA’2014), pages
161-170.
[16] D. Gregg, C. McSweeney, C. McElroy, F. Connor, S.
McGettrick, D. Moloney and D. Geraghty (2007). FPGA
Based Sparse Matrix Vector Multiplication using Com-
modity DRAM Memory. In International Conference
on Field Programmable Logic Applications (FPL’2007),
pages 786-791.
[17] T. Davis. University of Florida Sparse Matrix Collec-
tion. Available: http://www.cise.ufl.edu/research/sparse/
matrices.
[18] S. Sun, M. Monga, P.H. Jones and J. Zambreno (2012).
An I/O Bandwidth-Sensitive Sparse Matrix-Vector Mul-
tiplication Engine on FPGAs. In IEEE Trans. Circuits
Systems - Part I, (ITCS’2012), vol. 59, no. 1, pages
113–123. IEEE.

PLNB1BRE 202402010147235 Customer
No ratings yet
PLNB1BRE 202402010147235 Customer
6 pages
Guardium Key Lifecycle Management Level 2 Quiz - Attempt Review
100% (1)
Guardium Key Lifecycle Management Level 2 Quiz - Attempt Review
14 pages
CS609 SOLVED MCQs FINAL TERM BY JUNAID
100% (1)
CS609 SOLVED MCQs FINAL TERM BY JUNAID
38 pages
2212 07490
No ratings yet
2212 07490
41 pages
Sparse Matrix Storage Format
No ratings yet
Sparse Matrix Storage Format
4 pages
Group 5 Report
No ratings yet
Group 5 Report
21 pages
ASpT PPoPP19
No ratings yet
ASpT PPoPP19
15 pages
Chen 2022 FG SPMSP V
No ratings yet
Chen 2022 FG SPMSP V
29 pages
Proposal Presentation
No ratings yet
Proposal Presentation
22 pages
Ece408 Lecture19 Sparse Matrix VK SP23
No ratings yet
Ece408 Lecture19 Sparse Matrix VK SP23
28 pages
2404.06047v1
No ratings yet
2404.06047v1
34 pages
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
No ratings yet
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
21 pages
Merge-Based_Parallel_Sparse_Matrix-Vector_Multiplication_SC2016 (1)
No ratings yet
Merge-Based_Parallel_Sparse_Matrix-Vector_Multiplication_SC2016 (1)
12 pages
sparsematrics
No ratings yet
sparsematrics
28 pages
LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
No ratings yet
LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
8 pages
2020.6.sparse-Tpu Ics2020
No ratings yet
2020.6.sparse-Tpu Ics2020
12 pages
Systolic Sparse Matrix Vector Multiply in The Age of TPUs and Accelerators
No ratings yet
Systolic Sparse Matrix Vector Multiply in The Age of TPUs and Accelerators
10 pages
Sparse Matrix-Matrix Multiplication On Multilevel Memory Architectures: Algorithms and Experiments
No ratings yet
Sparse Matrix-Matrix Multiplication On Multilevel Memory Architectures: Algorithms and Experiments
24 pages
202403-Articles-CAF-Symmetric-FSM-Published
No ratings yet
202403-Articles-CAF-Symmetric-FSM-Published
9 pages
Stanley Assignment
No ratings yet
Stanley Assignment
6 pages
DS_UNIT_1
No ratings yet
DS_UNIT_1
16 pages
Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP
No ratings yet
Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP
7 pages
2018 - Optimizing Sparse Matrix-Vector Multiplications on armv8-based many-core architecture
No ratings yet
2018 - Optimizing Sparse Matrix-Vector Multiplications on armv8-based many-core architecture
14 pages
Implementing Linear Algebraalgorithms For Dense Matrices
No ratings yet
Implementing Linear Algebraalgorithms For Dense Matrices
22 pages
Matrix and Graph
No ratings yet
Matrix and Graph
44 pages
Efficient_Sparse_Matrix-Vector_Multiplication_on_GPUs_Using_the_CSR_Storage_Format_SC2014 (1)
No ratings yet
Efficient_Sparse_Matrix-Vector_Multiplication_on_GPUs_Using_the_CSR_Storage_Format_SC2014 (1)
12 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Yang 2018 Europa R
No ratings yet
Yang 2018 Europa R
16 pages
Final Project Report Transient Stability of Power System (Programming Massively Parallel Graphics Multiprocessors Using CUDA)
No ratings yet
Final Project Report Transient Stability of Power System (Programming Massively Parallel Graphics Multiprocessors Using CUDA)
5 pages
SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations
No ratings yet
SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations
15 pages
Application of AVX (Advanced Vector Extensions) For Improved PDF
No ratings yet
Application of AVX (Advanced Vector Extensions) For Improved PDF
8 pages
Multithreaded Architectures: (Applied Parallel Programming)
No ratings yet
Multithreaded Architectures: (Applied Parallel Programming)
29 pages
Minres Algorithm FPGA
No ratings yet
Minres Algorithm FPGA
6 pages
2851141.2851190
No ratings yet
2851141.2851190
2 pages
Abstract:: Keywords
No ratings yet
Abstract:: Keywords
2 pages
Christen 07
No ratings yet
Christen 07
8 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Kronbichler, Kormann - 2012 - A generic interface
No ratings yet
Kronbichler, Kormann - 2012 - A generic interface
13 pages
Module 1.6
No ratings yet
Module 1.6
53 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Lecture 17
No ratings yet
Lecture 17
7 pages
2016 Jasak迭代求解
No ratings yet
2016 Jasak迭代求解
30 pages
19_Computer_Architecture_Vector_processor
No ratings yet
19_Computer_Architecture_Vector_processor
20 pages
module-4-chapter-2
No ratings yet
module-4-chapter-2
42 pages
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
No ratings yet
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
22 pages
Interface For Sparse Linear Algebra Operations
No ratings yet
Interface For Sparse Linear Algebra Operations
43 pages
CSE 211: Data Structures Lecture Notes II: Ender Özcan, Şebnem Baydere
No ratings yet
CSE 211: Data Structures Lecture Notes II: Ender Özcan, Şebnem Baydere
6 pages
Sparse Matrices in MATLAB P:Design and Implementation: December 2004
No ratings yet
Sparse Matrices in MATLAB P:Design and Implementation: December 2004
13 pages
Spaghetti: Streaming Accelerators For Highly Sparse Gemm On Fpgas
No ratings yet
Spaghetti: Streaming Accelerators For Highly Sparse Gemm On Fpgas
13 pages
Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec
No ratings yet
Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec
8 pages
pp21 40
No ratings yet
pp21 40
20 pages
Modeling a Non-uniform Memory Access Architecture for Optimizing
No ratings yet
Modeling a Non-uniform Memory Access Architecture for Optimizing
79 pages
Sparse Matrix Technology PDF
No ratings yet
Sparse Matrix Technology PDF
45 pages
احمد فراس
No ratings yet
احمد فراس
4 pages
Performance Optimization and Evaluation For Linear Codes
No ratings yet
Performance Optimization and Evaluation For Linear Codes
8 pages
Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0
No ratings yet
Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0
4 pages
Tiny Project 1
No ratings yet
Tiny Project 1
2 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
Sparse Matrix Converter Thesis
100% (3)
Sparse Matrix Converter Thesis
6 pages
Matrix-Matrix Multiplication
No ratings yet
Matrix-Matrix Multiplication
8 pages
e363934f-7f60-4c34-b1fa-fa1936976d49
No ratings yet
e363934f-7f60-4c34-b1fa-fa1936976d49
8 pages
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
No ratings yet
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
5 pages
No Cs
No ratings yet
No Cs
3 pages
Sub 287
No ratings yet
Sub 287
16 pages
Is RISC V Ready For Space A Security Perspective
No ratings yet
Is RISC V Ready For Space A Security Perspective
6 pages
Machine Learning
No ratings yet
Machine Learning
64 pages
Future Trends in Computer Architecture
No ratings yet
Future Trends in Computer Architecture
4 pages
Diannao Asplos2014
No ratings yet
Diannao Asplos2014
15 pages
Final
No ratings yet
Final
145 pages
Machine Learning: What Is Data and Model? Machine Learning Workflow Distance Based Classifiers Bayes Decision Theory
No ratings yet
Machine Learning: What Is Data and Model? Machine Learning Workflow Distance Based Classifiers Bayes Decision Theory
81 pages
Introduction
No ratings yet
Introduction
10 pages
Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns
No ratings yet
Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns
127 pages
MP Set 092 17
No ratings yet
MP Set 092 17
14 pages
Chap1-2 Markov Chain
No ratings yet
Chap1-2 Markov Chain
82 pages
HW 4
No ratings yet
HW 4
1 page
E0294 Scribe Lecture 9
No ratings yet
E0294 Scribe Lecture 9
24 pages
Maxima
No ratings yet
Maxima
3 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
Linear Programming: - Socrates
No ratings yet
Linear Programming: - Socrates
21 pages
Skyline Daa 1
No ratings yet
Skyline Daa 1
8 pages
A European Roadmap To Leverage RISC-V in Space Applications
No ratings yet
A European Roadmap To Leverage RISC-V in Space Applications
7 pages
List of Projects
No ratings yet
List of Projects
1 page
Asg 0
No ratings yet
Asg 0
1 page
HighPerformanceSpaceflightComputing HPSC
No ratings yet
HighPerformanceSpaceflightComputing HPSC
19 pages
Adjacency Matrix To List
No ratings yet
Adjacency Matrix To List
2 pages
Hardware Is The New Software
No ratings yet
Hardware Is The New Software
8 pages
20 Induction
No ratings yet
20 Induction
25 pages
05.03 OBDP2021 Steenari
No ratings yet
05.03 OBDP2021 Steenari
9 pages
Homework 9 2005
No ratings yet
Homework 9 2005
1 page
Homework 3 2005
No ratings yet
Homework 3 2005
2 pages
Homework 8 2005
No ratings yet
Homework 8 2005
1 page
VS 21ELN14 Module 3 Notes
No ratings yet
VS 21ELN14 Module 3 Notes
25 pages
OBS Setting Guideline To Fit Sanyipace Capture Card: - Settings About Audio
No ratings yet
OBS Setting Guideline To Fit Sanyipace Capture Card: - Settings About Audio
6 pages
Slides Synchronization I
No ratings yet
Slides Synchronization I
65 pages
Nutanix.NCP-EUC.v2024-04-08.q33
No ratings yet
Nutanix.NCP-EUC.v2024-04-08.q33
18 pages
Optical Fiber Communication Model With Arduino Word Doc
No ratings yet
Optical Fiber Communication Model With Arduino Word Doc
6 pages
Radeon Software Command Line Installation User Guide
No ratings yet
Radeon Software Command Line Installation User Guide
5 pages
EEE415 Week04 Machine Language
No ratings yet
EEE415 Week04 Machine Language
103 pages
VMware Administrator
No ratings yet
VMware Administrator
2 pages
Heuft Inline Reflexx SpecificationOverview
No ratings yet
Heuft Inline Reflexx SpecificationOverview
11 pages
Compatibility Matrix FOS 8x
No ratings yet
Compatibility Matrix FOS 8x
51 pages
Reconfigurable Computing CS G553
No ratings yet
Reconfigurable Computing CS G553
17 pages
Advanced Cad Cam Lab Theory and Practice
No ratings yet
Advanced Cad Cam Lab Theory and Practice
26 pages
Spare Parts Catalog: 45E680X1UK-01
No ratings yet
Spare Parts Catalog: 45E680X1UK-01
46 pages
The Spartan
No ratings yet
The Spartan
3 pages
Getting Started With The AMD Robotics Hardware Portfolio - Final v2
No ratings yet
Getting Started With The AMD Robotics Hardware Portfolio - Final v2
38 pages
Mallinckrodtliebel-Flarsheim Parts Manual
100% (1)
Mallinckrodtliebel-Flarsheim Parts Manual
365 pages
Service Terminal Essentials: DN9770946 Issue 8-1
No ratings yet
Service Terminal Essentials: DN9770946 Issue 8-1
64 pages
PIC Microcontroller
100% (3)
PIC Microcontroller
21 pages
Answer:: CS1541 Fall 2008 Solution To The Optional Homework On Virtual Memory
No ratings yet
Answer:: CS1541 Fall 2008 Solution To The Optional Homework On Virtual Memory
6 pages
Release Note tg5040 Smartpro 20240401
No ratings yet
Release Note tg5040 Smartpro 20240401
3 pages
Century Computers & Laptops Rate List
No ratings yet
Century Computers & Laptops Rate List
4 pages
r3 3200g
No ratings yet
r3 3200g
1 page
History of Call: (Computer Assisted Language Learning)
0% (1)
History of Call: (Computer Assisted Language Learning)
10 pages
SP2.2 Assembler-Machine-Dependent Assembler Features
No ratings yet
SP2.2 Assembler-Machine-Dependent Assembler Features
13 pages
PowerEdge Architecture Technical Overview
No ratings yet
PowerEdge Architecture Technical Overview
24 pages
PM8916 Voltage Mode Battery Monitoring System Driver Overview
No ratings yet
PM8916 Voltage Mode Battery Monitoring System Driver Overview
35 pages
Systimax 360 Ipatch Instapatch Fiber Shelf Instructions: General
No ratings yet
Systimax 360 Ipatch Instapatch Fiber Shelf Instructions: General
12 pages
Dell Docking Compatibility Guide
No ratings yet
Dell Docking Compatibility Guide
8 pages

Extended

Uploaded by

Extended

Uploaded by

1

FPGA Multi-Processor for Sparse Matrix

B. SpMV Algorithms proposed which aim to improve the computation of SpMV.

to maximize the precision of the multiply-add sequence. The

(a) Adder Tree

Fig. 9. Implementation of two SpMV Accelerators

interfaces, which are connected to the AXI HP and AXI

(b) Queues Cascade TABLE I

Number of mk10-b4 Maragal 2 Flower 5 4 bibd 14 7

[12] [13] [18] [14] [16] [15] This work

mance Computing. Instituto de Engenharia de Sistemas

You might also like