Extended
Extended
Abstract—In this thesis it is proposed a novel computational regarding the position of each in the matrix, having its total
efficient multi-core architecture for solving sparse matrix-vector size proportional to the number of nonzero elements.
multiply (SpMV) in FPGA. The efficient implementation of As an example of applicability in most of the fields
SpMV is challenging, as simple implementations of the kernels
typically give a performance that is only a fraction of the peak. named above, solving a partial differential equation using
At the center of this problem is that sparse operations are finite elements method boils down to solving a system of
more bandwidth-bound than dense ones. Previous works on the linear equations of the form y = Ax, where y and x are
subject suggest that the use of FPGA for solving SpMV can vectors and A is a large matrix that is mostly composed of
improve performance levels when compared to the use of General zero entries. Nonzero elements of A would be arranged in
Purpose Processors (GPPs), thereby improving total execution
times when solving problems largely dependent of SpMV. As a regular or irregular pattern depending on the selection of
such, in this work the existing sparse matrix compression/storing a structured or unstructured mesh for discretization of the
formats are analyzed, their feasibility for an efficient imple- original problem [1, 2]. The efficient implementation of these
mentation is verified, and lastly a multi-processor architecture operations is extremely important; however, it is challenging
is proposed, capable of better using the available bandwidth in as well, as simple implementations of the kernels typically
order to achieve higher levels of performance. Through extensive
experimentation on a wide dataset, the architecture exhibits the give a performance that is only a fraction of peak [3, 4]. The
ability to outperform GPPs both in terms of peak and average center of its performance problem is that sparse operations are
performance. Given the target platform (Zynq), the performance more bandwidth-bound than dense ones [5, 6]. Consequently,
of the developed architecture was compared to the ARM Cortex- optimizations are needed, but these are much dependent on
A9 GPP present the platform. The architecture has shown to architectural variations even between closely related versions
achieve an average of 78.5% (624.23 MFLOPS) performance with
a peak of 98.2% (785.23 MFLOPS). On the other hand, the ARM of the same processor. To this end, a number of optimization
processor attained an average of 141.53 MFLOPS, equivalent to techniques have been proposed, such as register and cache
21.8% of its peak performance, while achieving a peak of 204.11 blocking [7, 8], and column or row reordering [9].
MFLOPS (31.4%) of performance in one test. This translates The main contribution of this thesis is a novel approach to
into a performance improvement of 2.85−5.45×, averaging at the efficient computation of the Sparse Matrix-Vector Multiply
4.48× the performance of the ARM processor tested.
(SpMV) problem. To perform this, a reconfigurable multi-
Index Terms—Sparse Matrices, FPGA, Multi-Core Architec- processor architecture has been researched and developed,
ture
capable of sustaining high peak performance by avoiding the
common pitfalls that affect SpMV computation and therefore
I. I NTRODUCTION achieving better computational efficiency. This was only made
Sparse Linear algebra computations such as the matrix- possible after a thorough analysis of the storing formats
vector product or the solution of sparse linear systems are often and by efficiently exploiting the advantages provided by
the bottleneck of many scientific fields from computing fluid programmable logic, such as reduced costs for reading and
dynamics to structural engineering, electromagnetic analysis writing to local memory, pipelining the datapaths, etc. To
or even the study of economy models, to name a few. The complement the developed architecture and in order to further
task of computation regularly falls to CPUs and, due to the reduce the bandwidth required to perform SpMV, a novel
evolution in this field, performance is improving. However, the storing format named RCSC is also proposed. Storing the
task of obtaining peak performance from modern cache-based sparse matrix according to this format further reduces the
computational systems has proven to be extremely difficult. required bandwidth to perform SpMV without increasing the
Several factors contribute to this low performance such as storage space needed, keeping in line with formats known to
the underlying machine architecture, memory access behavior, reduce size to a minimum. The use of this format removes the
compiler technology and the nature of the input matrix which irregular memory accesses that limit the SpMV performance
may only be known at runtime. In sparse matrices, the fraction allowing for a streamline approach to data transfer. The
of nonzero elements is small when compared to the total removal of irregular accesses is the most important perfor-
number of elements. While it is possible to use generic mance improvement factor for the SpMV problem, where the
data structures and routines to perform computations with possibility of data reuse is very limited.
such matrices, this is inefficient as most calculations on zero The architecture is capable of receiving data via DMA [10]
elements are redundant and sometimes even impractical due streaming and equally partition the computation amongst the
to the large dimensions of these matrices. In practice, sparse implemented PE. These are capable of producing one result
matrices are stored using specialized data structures that only per clock cycle due to their pipelined FMA [11]. Although
store the nonzero values, along with additional information data retrieval from external memory and computation in the
2
PEs is fully overlapped, due to the unknown sparsity pattern of ordering the nonzero elements in arrays. Instead of a column
the matrix the process of writing the output vector to external index, a row index is stored for each value, as are column
memory cannot be overlapped. pointers. This means the CSC format is composed of vectors
The architecture was implemented on an evaluation and V AL, ROW IN D and COL P T R, where V AL is an array
development board based on the Xilinx Zynq-7020 All Pro- of the (top to bottom then left to right) nonzero values of the
grammable (AP) System-on-Chip (SoC). The Processing Sys- sparse matrix; ROW IN D is the row indexes corresponding
tem (PS) of the Zynq-7020 device can provide a bandwidth to the values in V AL; and COL P T R contains the index
of 1600MB/s which allows for an architecture composed of in A of the first nonzero element of column j. Due to its
4 PE to be implemented. Results wise, an average of 624 efficiency in arithmetic operations, column slicing, and matrix-
MFLOPS and 78.5% efficiency is attained for a dataset of vector products, this is the traditional format for specifying a
14 sparse matrices and input vectors. When compared to the sparse matrix in MATrix LABoratory (MATLAB). A depiction
performance of the ARM Cortex-A9, present in the PS side of of sparse matrix A of figure 1 compressed in CSR format is
the Zynq, execution times are improved by 2.85−5.45×, while shown in figure 3.
efficiency of the ARM GPP is on average 21.8%, equivalent
to 142 MFLOPS.
11 0 0 12
61 0 0 0
A=
II. S PARSE M ATRIX A LGORITHMS 0 72 0 0
A. Storing Formats 0 0 33 0
Fig. 1. Sparse matrix A
Enormous effort has been devoted to devise data storing
formats with the aim of maximizing performance. This means
that to fully optimize SpMV computation, we need to choose a
compression algorithm that takes the sparse matrices structures V AL = [11 12 61 72 33]
into account. The focus of this section is to briefly describe COL IN D = [0 3 0 1 2]
the most common compression/storage formats available. The ROW P T R = [0 2 3 4 5]
purpose of all different format variations is either to improve Fig. 2. Sparse matrix A in CSR format
the architectures’ ability to access data and perform computa-
tions or to reduce the total space required to store the matrix.
The most common format regardless of the used processor is
V AL = [11 61 72 33 12]
CSR, which stores the nonzeros row-wise. Another favorite for
ROW IN D = [0 1 2 3 0]
implementations in programmable logic fabric is CSC, which
COL P T R = [0 2 3 4 5]
stores the nonzeros column-wise. The use of other storing
formats depends on the processor used, as ELLPACK and S- Fig. 3. Sparse matrix A in CSC format
ELLPACK are most useful in GPU implementations due to
their SIMD capabilities. When comparing both of the previous formats, and assum-
1) CSR: The Compressed Sparse Row (CSR) format stores ing all vectors related to the compression format and vector x
an initial M × N sparse matrix A in row form using three are stored in memory, CSC provides less accesses to memory
one-dimensional arrays. Let nz denote the number of nonzero when compared to CSR since the values of vector x are reused
elements of A. The first array is called V AL and is of in the computation of each nonzero element belonging to the
length nz . This array holds the values of all the nonzero same column. For format CSR more values have to be trans-
entries of A in left-to-right then top-to-bottom (row-major) fered from memory for each computation. First, two values
order. The second array, COL IN D, contains the column of vector ROW P T R need to be transfered. By computing
index (zero-based) of each element of V AL. The third and ROW P T R[i + 1] − ROW P T R[i], the number of nonzero
final array, ROW P T R, is of length M + 1 (i.e. one entry elements in the row is obtained. Then, for each (value, index)-
per row, plus one). ROW P T R(i) contains the index in pair of nonzero elements transfered, the respective x[index]
A of the first nonzero element of row i. Row i of the needs to be retrieved. This not only translates into a different
original matrix A extends from V AL(ROW P T R(i)) to value of vector x for each nonzero in every row, but also a
V AL(ROW P T R(i + 1) − 1), i.e. from the start of one delay in memory transfers and isolated accesses. This delay
row to the last index before the start of the next. The last is caused by the necessary sequence in transfers of nonzero
entry in ROW P T R (at zero-based index m) must be the indexes and values while the isolated transfers are caused by
number of elements in V AL (i.e. the number of nonzero the values of x not being necessarily adjacent and, as such,
elements of A). The name of this format is based on the burst reading or caching of values not being very useful.
fact that row index information is compressed, which delivers These problems do not occur when dealing with a sparse
increased efficiency in arithmetic operations and row slicing matrix stored in CSC format. For every nonzero value and
partitioning algorithms. A depiction of sparse matrix A of index transfered, values of x needed are reused for each
figure 1 compressed in CSR format is shown in figure 2. column analyzed. This leads to improvements when using
2) CSC: Compressed Sparse Column (CSC) format is anal- cache and burst readings from memory, an important factor
ogous to CSR except that it follow a column-major order for in SpMV.
3
performed in the ARM processor. To verify the reliability of gives the minimum number of clock cycles required for the
the accelerator to process any sparse matrix, the chosen dataset nonzero element to enter the FMA unit when no hazard is
II included matrices whose parameters vary greatly. From detected. The value of variable W orst stall is given by the
square (N ×N , e.g. bcsstm27) to rectangular matrices (M ×N , number of pipeline stages of the FMA. In this architecture this
e.g. Flower 5 4) and diverse sparsity values ranging from as value is always equal to 8. N o stall is 2 as one clock cycle is
low as 0.02% (e.g. OPF 6000) to 23.08% (e.g. bibd 14 7). required to read an address from the input FIFO and another
Within all matrices the number of hazards varies greatly, for the corresponding y element to be at the output of local
from zero (e.g. bibd 14 7) to 47.46% (e.g. N pid). Each memory.
of these parameters influences computational time, although As one SpMV Accelerator is composed of one Packager and
only the accelerators are affected by RAW hazards, as the two PEs, and given that operations in the Packager and the PEs
ARM processor follows a distinct execution path. Attention are overlapped, the total number of clock cycles required to
was also given to the existing patterns in the matrices and process a nonzero element is given by the component with the
to the field of science each matrix belongs as to include the lowest throughput, the Packager or the two PEs, as shown by
most possible types in the dataset. The accelerator achieves equation 3.
an average performance of 624 MFLOPS for a computational
efficiency of 78.03%. This can be seen in figure 10 where the Accelerator [cycles/nz] =max P ackager [cycles/nz] ;
performance of both the ARM processor and the architecture
are represented in terms of achieved MFLOPS per matrix in P E [cycles/nz]
the dataset. The respective peak performances are represented 2
as well, clearly showing that the ARM processor achieves a (3)
performance level well below peak. The estimate for the total execution time must also consider
the transfer of all elements of vector y to external memory.
B. Analytical Model For this, the internal DMA FIFOs with a fixed length of 512
elements need to be filled with the maximum burst length (256
An analytical model was developed in order to predict the
elements) before the transfer is started. After the first burst of
performance of the architecture with larger sparse matrices,
256 elements, the process of filling the FIFO and transmitting
and when using more SpMV Accelerators. Given that the
the elements of vector y is performed simultaneously. With
transfer of data from the external memory and computation in
this latency accounted for, the execution time for an imple-
the SpMV Accelerator are performed concurrently, execution
mentation of k accelerators can be estimated by equation 4,
time depends mostly on the communication performance. To
where m is the number of elements of vector y, f represents
develop an analytical model each component that constitutes
the frequency at which the SpMV Accelerator works in MHz,
the implemented accelerators was modeled by determining the
and the reduction circuit now contributes to a larger portion of
number of clock cycles required to process a nonzero element.
execution time as it adds a term dependent on the number of
For the Packager unit, as one clock cycle is required per
pipeline stages of the adder used (Adder stages) times the
nonzero and structural element of the matrix, the total number
number of reduction levels (dlog2 (k)e), as previously depicted
of cycles required to process the entire sparse matrix is given
in figure 8:
by nz + n. As one instruction is sent to the PEs per nonzero
element, the number of clock cycles required per nonzero T heo exec time [us] =
element is given by equation 1. nz
P E[cycles/nz]
nz + n × max P ackager [cycles/nz] ; +
P ackager [cycles/nz] = (1) k 2
nz !
1
m + Adder stages × dlog2 (k)e + 256 ×
The number of cycles required by the PEs to perform com- f
putation depends on the number and distribution of hazards (4)
within the matrix. As no pattern can be assumed for the
nonzero elements within a sparse matrix, a simple Bernoulli To validate the developed model, estimated execution times
distribution is assumed for the occurrence of hazards. As were calculated for several sparse matrices using equation 4
such, the number of cycles required by each PE to process and compared to the values measured from execution in the
the nonzero elements arriving via the Packager is given by target device. Table III shows both experimental and estimated
equation 2: execution times (in microseconds), as well as the error between
measurement and estimation.
P E [cycles/nz] =W orst stall × phazard
(2)
+ N o stall × (1 − phazard ) C. Estimation for k accelerators
where phazard is the probability of a hazard in the sparse Assuming a target device with enough available bandwidth
matrix (equal to zero when no hazards occur and equal to one to feed a number k > 2 of accelerators, the analytical model
when a hazard exists for every nonzero element), W orst stall can estimate the execution times for any sparse matrix. Exe-
represents the highest number of clock cycles the PE needs to cution times were estimated for all matrices that constitute the
be stalled in order for the hazard to disperse and N o stall dataset, although, in order to avoid redundancy, matrices that
7
TABLE II
E XECUTION TIME REQUIRED TO COMPUTE SPARSE MATRICES IN BOTH THE ARM AND ONE S P MV ACCELERATOR AND RESPECTIVE OBTAINED SPEEDUP
AND ARCHITECTURAL EFFICIENCY
Matrix Rows Columns Sparsity (%) Hazards (%) ARM (us) Accelerator (us) Speedup Efficiency (%)
MK9-B3 945 1260 0.32 14.23 139 38 3.66 75.5
MK10-B4 945 4725 0.11 25.48 268 61 4.42 56.3
Maragal 2 555 350 2.24 2.50 128 32 3.95 93.7
Flower 5 4 5226 14721 0.06 3.17 1644 350 4.70 77.1
bibd 14 7 91 3432 23.08 0.00 2055 405 4.98 95.5
LP Pilot87 2030 6680 0.55 8.00 2222 489 4.55 81.1
bcsstm27 1224 1224 1.91 0.13 803 163 4.92 96.1
qc2534 2534 2534 3.63 0.17 6274 1215 5.16 98.2
big 13209 13209 0.05 28.26 3122 967 3.23 57.5
adder dcop 64 1813 1813 0.33 21.56 365 106 3.45 64.8
p2p gnutella08 6301 6301 0.05 20.45 532 233 2.29 68.2
N pid 3625 3923 0.06 47.17 341 132 2.59 51.2
OPF 6000 29902 29902 0.02 44.19 5104 1885 2.71 47.7
SiNa 5743 5743 0.31 1.63 3021 599 5.05 95.0
Fig. 10. ARM and Hardware performance measurement in MFLOPS for each matrix in the dataset
TABLE III relatively small (n nz ), the time spent adding all partial
E XPERIMENTAL RESULTS VS . MODEL PREDICTIONS results of vector y from each accelerator in the reduction
Matrix HW exec (us) Theo exec (us) Error (%)
circuit is a small fraction of the total computational time. As
evidenced by table IV, speedup gains increase almost linearly
MK9-B3 38 39 2.82
with the number of implemented accelerators.
MK10-B4 61 59 2.11
Maragal 2 32 32 2.12 V. C OMPARISON TO EXISTING S P MV A RCHITECTURES
Flower 5 4 350 348 0.47 In table V, the performance of the implemented architecture
bibd 14 7 405 381 6.02 is compared to previous architectures. Naturally, sparsity of
LP Pilot87 489 488 0.25 the tested matrices influence the efficiency of the developed
bcsstm27 163 164 0.63 accelerator, as occurs on the remaining works. Sparse matrix
qc2534 1215 1205 0.80 parameters that adversely affect the implemented accelerator
big 967 980 1.36 performance include the number of rows as it increases the
adder dcop 64 106 110 3.69 size of the output vector required to be written to external
p2p gnutella08 233 233 0.19 memory, the ratio of matrix columns to nonzero elements as
N pid 132 136 3.45 it influences the performance of the Packager unit and the
OPF 6000 1885 1990 5.57 number of hazards within the matrix, as each requires the PE
SiNa 599 600 0.25 to stall the FMA core until the hazard is resolved.
All results were compared to a software-only execution in
the ARM, optimized to take full advantage of the existing
produced similar results are not shown in table III. The number vector floating-point unit (128-bit SIMD) and operating at a
of Accelerators was varied between three and seven and for frequency of 650MHz, more than six times the frequency at
comparison sake, results from implementations with one and which the programmable logic operates (100 MHz).
two accelerators are also included. Despite the length of the
reduction circuit increasing with the number of implemented VI. C ONCLUSION
accelerators (as given by the reduction time increasing with The results show that the architecture proposed in this work
k), as the number of elements passing through this block is is able to achieve an average of 624 MFLOPS (single precision
8
TABLE IV
H ARDWARE AND S OFTWARE EXECUTION TIMES , IN MICROSECONDS ( US ), AND RESPECTIVE SPEEDUP OBTAINED FOR A SYSTEM WITH k S P MV
ACCELERATORS USING C OLUMN B LOCKS
TABLE V
C OMPARISON OF FPGA S P MV A RCHITECTURES
floating-point). This corresponds to a performance efficiency ing of Blocked Sparse Kernels. In Technical Report
of 79%. These figures are better than those obtained by the ICL-UT-04-05.
ARM processor where 142 MFLOPS were measured, which [5] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick
corresponds to 22% of peak performance. This translates into and J. Demmel (2007). Optimization of Sparse Matrix-
a performance improvement of 4.39× on average. Given the Vector Multiplication on Emerging Multicore Platforms.
predictions of the analytical model, it is expected that the In Supercomputing.
architecture is able to scale with the available bandwidth. [6] J. D. Davis and E. S. Chung (2012). SpMV: A Memory-
Speedup values of 10.56× on average the execution in a gen- Bound Application on the GPU Stuck Between a Rock
eral purpose processor are to be expected, as long as enough and a Hard Place. In Microsoft Technical Report.
bandwidth is available, along with resources to implement the [7] E.-J. Im and K. Yelick (1999). Optimizing Sparse
architecture in programmable logic. Matrix Vector Multiplication on SMPs. In Ninth
SIAM Conference on Parallel Processing for Scientific
Computing (SIAM’1999). SIAM.
R EFERENCES [8] E.-J. Im and K. Yelick (1997). Optimizing sparse
matrix computations for register reuse in SPARSITY.
[1] C. Gfrerer (2012). Optimization Methods for Sparse In Proceedings of the International Conference on
Matrix-Vector Multiplication. In Seminar for Computer Computational Sciences - Part I, pages 127-136.
Science. [9] A. Pinar and M. T. Heath (1999). Improving per-
[2] Y. Saad (2003). Iterative methods for sparse linear sys- formance of sparse matrix-vector multiplication. In
tems. In SIAM International Conference on Data Mining Supercomputing.
(SIAM’2003). SIAM. [10] Wilson M. José, Ana Rita Silva, Mário P. Véstias,
[3] R. W. Vuduc and H.-J. Moon (2005). Fast sparse matrix- Horácio C. Neto (2014) Algorithm-oriented design of
vector multiplication by exploiting variable block struc- efficient many-core architectures applied to dense matrix
ture. In Proceedings of High-Performance Computing multiplication. In Analog Integrated Circuits and Signal
and Communications Conference (HPCC’2005), pages Processing, pages 9–16. IEEE.
807–816. [11] Mário P. Véstias (2014) DReaMaCHine - Design of a
[4] A. Buttari, V. Eijkhout, J. Langou and S. Filip- Reconfigurable Many-Core Architecture for High Perfor-
pone (2005). Performance Optimization and Model-
9