0% found this document useful (0 votes)
13 views9 pages

Extended

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views9 pages

Extended

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

FPGA Multi-Processor for Sparse Matrix


Applications
João Pinhão, IST

Abstract—In this thesis it is proposed a novel computational regarding the position of each in the matrix, having its total
efficient multi-core architecture for solving sparse matrix-vector size proportional to the number of nonzero elements.
multiply (SpMV) in FPGA. The efficient implementation of As an example of applicability in most of the fields
SpMV is challenging, as simple implementations of the kernels
typically give a performance that is only a fraction of the peak. named above, solving a partial differential equation using
At the center of this problem is that sparse operations are finite elements method boils down to solving a system of
more bandwidth-bound than dense ones. Previous works on the linear equations of the form y = Ax, where y and x are
subject suggest that the use of FPGA for solving SpMV can vectors and A is a large matrix that is mostly composed of
improve performance levels when compared to the use of General zero entries. Nonzero elements of A would be arranged in
Purpose Processors (GPPs), thereby improving total execution
times when solving problems largely dependent of SpMV. As a regular or irregular pattern depending on the selection of
such, in this work the existing sparse matrix compression/storing a structured or unstructured mesh for discretization of the
formats are analyzed, their feasibility for an efficient imple- original problem [1, 2]. The efficient implementation of these
mentation is verified, and lastly a multi-processor architecture operations is extremely important; however, it is challenging
is proposed, capable of better using the available bandwidth in as well, as simple implementations of the kernels typically
order to achieve higher levels of performance. Through extensive
experimentation on a wide dataset, the architecture exhibits the give a performance that is only a fraction of peak [3, 4]. The
ability to outperform GPPs both in terms of peak and average center of its performance problem is that sparse operations are
performance. Given the target platform (Zynq), the performance more bandwidth-bound than dense ones [5, 6]. Consequently,
of the developed architecture was compared to the ARM Cortex- optimizations are needed, but these are much dependent on
A9 GPP present the platform. The architecture has shown to architectural variations even between closely related versions
achieve an average of 78.5% (624.23 MFLOPS) performance with
a peak of 98.2% (785.23 MFLOPS). On the other hand, the ARM of the same processor. To this end, a number of optimization
processor attained an average of 141.53 MFLOPS, equivalent to techniques have been proposed, such as register and cache
21.8% of its peak performance, while achieving a peak of 204.11 blocking [7, 8], and column or row reordering [9].
MFLOPS (31.4%) of performance in one test. This translates The main contribution of this thesis is a novel approach to
into a performance improvement of 2.85−5.45×, averaging at the efficient computation of the Sparse Matrix-Vector Multiply
4.48× the performance of the ARM processor tested.
(SpMV) problem. To perform this, a reconfigurable multi-
Index Terms—Sparse Matrices, FPGA, Multi-Core Architec- processor architecture has been researched and developed,
ture
capable of sustaining high peak performance by avoiding the
common pitfalls that affect SpMV computation and therefore
I. I NTRODUCTION achieving better computational efficiency. This was only made
Sparse Linear algebra computations such as the matrix- possible after a thorough analysis of the storing formats
vector product or the solution of sparse linear systems are often and by efficiently exploiting the advantages provided by
the bottleneck of many scientific fields from computing fluid programmable logic, such as reduced costs for reading and
dynamics to structural engineering, electromagnetic analysis writing to local memory, pipelining the datapaths, etc. To
or even the study of economy models, to name a few. The complement the developed architecture and in order to further
task of computation regularly falls to CPUs and, due to the reduce the bandwidth required to perform SpMV, a novel
evolution in this field, performance is improving. However, the storing format named RCSC is also proposed. Storing the
task of obtaining peak performance from modern cache-based sparse matrix according to this format further reduces the
computational systems has proven to be extremely difficult. required bandwidth to perform SpMV without increasing the
Several factors contribute to this low performance such as storage space needed, keeping in line with formats known to
the underlying machine architecture, memory access behavior, reduce size to a minimum. The use of this format removes the
compiler technology and the nature of the input matrix which irregular memory accesses that limit the SpMV performance
may only be known at runtime. In sparse matrices, the fraction allowing for a streamline approach to data transfer. The
of nonzero elements is small when compared to the total removal of irregular accesses is the most important perfor-
number of elements. While it is possible to use generic mance improvement factor for the SpMV problem, where the
data structures and routines to perform computations with possibility of data reuse is very limited.
such matrices, this is inefficient as most calculations on zero The architecture is capable of receiving data via DMA [10]
elements are redundant and sometimes even impractical due streaming and equally partition the computation amongst the
to the large dimensions of these matrices. In practice, sparse implemented PE. These are capable of producing one result
matrices are stored using specialized data structures that only per clock cycle due to their pipelined FMA [11]. Although
store the nonzero values, along with additional information data retrieval from external memory and computation in the
2

PEs is fully overlapped, due to the unknown sparsity pattern of ordering the nonzero elements in arrays. Instead of a column
the matrix the process of writing the output vector to external index, a row index is stored for each value, as are column
memory cannot be overlapped. pointers. This means the CSC format is composed of vectors
The architecture was implemented on an evaluation and V AL, ROW IN D and COL P T R, where V AL is an array
development board based on the Xilinx Zynq-7020 All Pro- of the (top to bottom then left to right) nonzero values of the
grammable (AP) System-on-Chip (SoC). The Processing Sys- sparse matrix; ROW IN D is the row indexes corresponding
tem (PS) of the Zynq-7020 device can provide a bandwidth to the values in V AL; and COL P T R contains the index
of 1600MB/s which allows for an architecture composed of in A of the first nonzero element of column j. Due to its
4 PE to be implemented. Results wise, an average of 624 efficiency in arithmetic operations, column slicing, and matrix-
MFLOPS and 78.5% efficiency is attained for a dataset of vector products, this is the traditional format for specifying a
14 sparse matrices and input vectors. When compared to the sparse matrix in MATrix LABoratory (MATLAB). A depiction
performance of the ARM Cortex-A9, present in the PS side of of sparse matrix A of figure 1 compressed in CSR format is
the Zynq, execution times are improved by 2.85−5.45×, while shown in figure 3.
efficiency of the ARM GPP is on average 21.8%, equivalent
to 142 MFLOPS. 
11 0 0 12

61 0 0 0
A= 
II. S PARSE M ATRIX A LGORITHMS 0 72 0 0
A. Storing Formats 0 0 33 0
Fig. 1. Sparse matrix A
Enormous effort has been devoted to devise data storing
formats with the aim of maximizing performance. This means
that to fully optimize SpMV computation, we need to choose a
compression algorithm that takes the sparse matrices structures V AL = [11 12 61 72 33]
into account. The focus of this section is to briefly describe COL IN D = [0 3 0 1 2]
the most common compression/storage formats available. The ROW P T R = [0 2 3 4 5]
purpose of all different format variations is either to improve Fig. 2. Sparse matrix A in CSR format
the architectures’ ability to access data and perform computa-
tions or to reduce the total space required to store the matrix.
The most common format regardless of the used processor is
V AL = [11 61 72 33 12]
CSR, which stores the nonzeros row-wise. Another favorite for
ROW IN D = [0 1 2 3 0]
implementations in programmable logic fabric is CSC, which
COL P T R = [0 2 3 4 5]
stores the nonzeros column-wise. The use of other storing
formats depends on the processor used, as ELLPACK and S- Fig. 3. Sparse matrix A in CSC format
ELLPACK are most useful in GPU implementations due to
their SIMD capabilities. When comparing both of the previous formats, and assum-
1) CSR: The Compressed Sparse Row (CSR) format stores ing all vectors related to the compression format and vector x
an initial M × N sparse matrix A in row form using three are stored in memory, CSC provides less accesses to memory
one-dimensional arrays. Let nz denote the number of nonzero when compared to CSR since the values of vector x are reused
elements of A. The first array is called V AL and is of in the computation of each nonzero element belonging to the
length nz . This array holds the values of all the nonzero same column. For format CSR more values have to be trans-
entries of A in left-to-right then top-to-bottom (row-major) fered from memory for each computation. First, two values
order. The second array, COL IN D, contains the column of vector ROW P T R need to be transfered. By computing
index (zero-based) of each element of V AL. The third and ROW P T R[i + 1] − ROW P T R[i], the number of nonzero
final array, ROW P T R, is of length M + 1 (i.e. one entry elements in the row is obtained. Then, for each (value, index)-
per row, plus one). ROW P T R(i) contains the index in pair of nonzero elements transfered, the respective x[index]
A of the first nonzero element of row i. Row i of the needs to be retrieved. This not only translates into a different
original matrix A extends from V AL(ROW P T R(i)) to value of vector x for each nonzero in every row, but also a
V AL(ROW P T R(i + 1) − 1), i.e. from the start of one delay in memory transfers and isolated accesses. This delay
row to the last index before the start of the next. The last is caused by the necessary sequence in transfers of nonzero
entry in ROW P T R (at zero-based index m) must be the indexes and values while the isolated transfers are caused by
number of elements in V AL (i.e. the number of nonzero the values of x not being necessarily adjacent and, as such,
elements of A). The name of this format is based on the burst reading or caching of values not being very useful.
fact that row index information is compressed, which delivers These problems do not occur when dealing with a sparse
increased efficiency in arithmetic operations and row slicing matrix stored in CSC format. For every nonzero value and
partitioning algorithms. A depiction of sparse matrix A of index transfered, values of x needed are reused for each
figure 1 compressed in CSR format is shown in figure 2. column analyzed. This leads to improvements when using
2) CSC: Compressed Sparse Column (CSC) format is anal- cache and burst readings from memory, an important factor
ogous to CSR except that it follow a column-major order for in SpMV.
3

B. SpMV Algorithms proposed which aim to improve the computation of SpMV.


Based on these improvements and the RCSC storing format,
Most of the previous relevant works on this area have been the development of a reconfigurable architecture capable of
done on General Purpose Processors (GPPs) and GPUs, where achieving high computation efficiency and reduced execution
improvements can mostly be done regarding the data stored time is detailed.
in cache or in local GPU memory. An analysis of five FPGA-
based architectures indicated that performance improvements III. A RCHITECTURE OVERVIEW
to the SpMV problem depend on efficiently exploring the stor-
A. Hardware/Software Interface Evaluation
ing formats while minimizing the communication overheads.
By approaching the problem in a row-wise fashion, two of In order to achieve the best possible utilization of the
the five analyzed works [12, 13] achieved low computational available bandwidth to ensure an optimized SpMV accelerator,
efficiencies due to limitations in scaling the implemented the capabilities if the target device, the Zynq-7000, had to
design. Another work [14] based on format CSR achieved a be evaluated. The available interfaces for integrating PL-
higher efficiency at the cost of locally storing a copy of vector based accelerators and the PS are: the AXI ACP, AXI HP
x in each PE and therefore eliminating indirect references to and AXI GP. The AXI High Performance (HP) interface is
this vector. The two remaining works [15, 16] approached the composed of four high performance/high bandwidth master
problem column-wise by choosing CSC and SPAR formats, ports capable of 32-bit or 64-bit data transfers with separate
respectively. By reducing the indirect references to vector x, read/write command issuing for concurrent operations, medi-
the use of a column-major storing format provided a higher ated by an asynchronous 1KB FIFO. Each of these interfaces
computational efficiency and improved the scalability of the can provide at most 1200MB/s when 64-bit wide, just as
architecture without increasing local memory requirements. the AXI ACP, or 600MB/s for 32-bit wide. When using all
In order to develop an efficient SpMV architecture, a four high performance ports, the total available throughput is
suitable storing format must be chosen. The choice lies be- 4800MB/s when using 64-bit mode or 2400MB/s when in 32-
tween row-major, column-major and other formats. This last bit mode. All these throughputs assume a clock frequency of
option includes formats that imply variable arguments, i.e. a 150MHz. Hence, using the four available High Performance
variable block size, which is only known at runtime. Therefore Ports to transfer data to/from an architecture implemented in
a hardware implementation of an SpMV accelerator based the PL and external memory (i.e. DDR) provides four 32-
on these formats is likely to perform poorly. As such, by bit words per clock cycle and an average throughput of 2538
discarding these variable formats, the choice is now limited MB/s.
to row-major formats or column-major formats. The deciding
factor is given by the ability of each format in minimizing one B. Workload Balancing
of the major problems in efficient SpMV computation: indirect No sparsity pattern is assumed for the input matrices. As
memory references. If a sparse matrix is stored in a row- such, data transfers and computational parallelism needs to be
major format, e.g. CSR, the nonzero column indexes are also chosen and scheduled accordingly. The most common is the
stored. These are the cause of the indirect memory references, diagonal pattern 4(a) in which all nonzero values are located
as x[index] = xBASE ADDRESS + index is performed for in the main matrix diagonal and/or scattered alongside it. It
each nonzero element in the matrix, and the index value is is included in the denomination of diagonal pattern matrices
not likely to follow any pattern. Another factor to consider consisting of patterns with 3 diagonals 4(b), in which the
is that 3 load operations precede each nonzero computation. nonzero values are located across 3 diagonals, one primary
One to retrieve the nonzero element, another for the respective and two secondary. As such, assigning row ranges per PE
index, and one for the value of x given by the nonzero index. results in an idle system the majority of the time. To solve this,
In turn, by assuming a sparse matrix stored in a column- single rows were alternately assigned to each processor in a
major format, e.g CSC, the indexes represent the row of round-robin scheme. This guarantees a good workload balance
each nonzero element. In an FPGA-based implementation, amongst the existing processors, as evidenced in figure 5, and
by storing the output vector y in local memory consisting thus, given a good hardware implementation of each PE, it is
of BRAMs, these irregular accesses cost one clock cycle, expected that the least number of PEs are idle for the least
as opposed to several in external memory. As such, it is amount of time.
reasonable to say that a column-major format is the most
suitable for our implementation. The proposal herein is to
modify the CSC format in which the sparse matrix is stored,
such that only 2 vectors are used to represent the matrix-vector
data, instead of 4. Vector x and vector V AL are merged in
one vector still named V AL X, and vectors ROW IN D and
COL P T R are merged in one vector named IN D P T R.
Given the reasons presented above, an important number (a) Sparse Matrix (b) Sparse Matrix
of improvements, such as the use of FMA units in order to cdde1 ck656
pipeline computations, RAW hazard detection and methods of Fig. 4. Diagonal Sparse Matrices cdde1 and ck656 from [17]
workload division across a configurable number of PEs, are
4

to maximize the precision of the multiply-add sequence. The


FMA is implemented as an 8-stage pipeline, and can produce
one result per clock cycle. When processing a row of the sparse
matrix in a PE, a Read After Write (RAW) hazard may occur
when the value entering is read before it is updated in local
memory. To solve these, the delay block, whose purpose is to
delay the address of local memory to be updated, is altered
to indicate if a computation entering the FMA creates a RAW
Hazard. When this occurs, stalls are implemented until the
Fig. 5. Nonzero assignment of a sample of 7 sparse matrices to a 4 PE system
(percentage of total nonzero elements of the respective matrix) address is updated with the correct value.
Testing of the architecture has shown that a limitation exists
in the number of PEs that can be fed with data by the
C. Architecture Packager. Given the need to verify a RAW hazard before each
computation, the PEs requires two clock cycles per computa-
tion entering the FMA. Given that the Packager instructs one
computation per clock cycle to the Linear Processor Array,
more than two PEs resulted in idle processors.
The novel format (RCSC) proposed in this work requires
a lower bandwidth per accelerator and therefore allows to
increase the number of parallel accelerators for the same fixed
bandwidth. Given the HP interface capability of providing
128-bits per clock cycle and the developed SpMV accelerator
requiring 64-bits, two accelerators can be implemented in the
Fig. 6. RCSC Architecture Block Diagram
target device. Therefore, the best partitioning scheme between
accelerators was also studied. We concluded that partitioning
the sparse matrix in row blocks and assigning each to an
accelerator lead to an increase in the number of RAW hazards
in the blocks. Another limitation was the need to pre-process
the matrix to construct these blocks, resulting in both more
execution time and more external memory usage. On the
other hand, partitioning the matrix in column blocks resulted
in almost no pre-processing, as RCSC is already a column-
major format. The processing of each block results in a partial
vector per accelerator, with the final output vector y being
the accumulation of all partial vectors. The best process of
accumulating these vectors was also studied. Both options for
reduction circuits, an Adder Tree and a Queues Cascade as
pictured in figure 8, were studied. Given the execution times of
Fig. 7. Processing Element both options, the Adder Tree was chosen as the most suitable
to sum all partial vectors.
Addressing the problems inherent to SpMV and, in conjunc-
tion with an exploration of the target device capabilities, an
IV. P ERFORMANCE E VALUATION
architecture that seeks to solve these problems was developed.
This involved scheduling data transfers and computational To verify the functionality, the architecture was imple-
parallelism accordingly. Through the methodical process of mented in the target device, the Zynq-7020. Figure 9 also
testing the architecture, an efficient and scalable architecture includes the NEON MPE and VFPv3 cores of the Application
was developed, capable of processing any sparse matrix and Processing Unit (APU), as well as the timer. The NEON and
input vector. The SpMV Accelerator (figure 6) consists of a the Vector Floating-Point s(VFP) cores extend the functional-
Packager Unit, that processes and distributes the matrix/vector ity of the ARM processor by providing support for advanced
data to a Linear Processing Array, composed by p PEs SIMD and vector floating-point v3 instruction sets. Some
connected linearly. Data is received at the Packager through of the features of the NEON MPE core are SIMD vector
two DMA units connected to two HP ports. Each Processing and scalar single-precision floating-point computations. The
Element is composed of an input FIFO meant to store the vectors supported are 128-bit wide due to the registers used but
instructions sent by the Packager to be processed, a delay will execute with 64-bits at a time. Since both Advanced SIMD
block, a Fused Multiply-Add (FMA) [11] operator and local and VFPv3 are implemented, they share a 32 64-bit register
memory. Figure 7 shows a block diagram of each PE. The bank. The inclusion of these cores on the schematic seen in
FMA performs single precision floating-point operations, only figure 9 is tied to the optimizations done to the code run by the
rounding and normalizing the result at the final stage, in order ARM processor. To compare computational times required to
5

(a) Adder Tree

Fig. 9. Implementation of two SpMV Accelerators

interfaces, which are connected to the AXI HP and AXI


GP ports respectively. For each resource is also presented a
percentage of the total available in the device.

(b) Queues Cascade TABLE I


R ESOURCE UTILIZATION OF TWO S P MV ACCELERATORS PLUS
Fig. 8. Methods of reducing vector y in Column Blocks for a system REDUCTION CIRCUIT AND SYSTEM OVERALL
composed of k SpMV Accelerators
Resource Total Accelerators + Reduction System
Registers 106400 3876 (3,64%) 9473 (8,90%)
obtain both hardware and software results, the ARM processor
LUTs 53200 5480 (10,30%) 10702 (20,12%)
was made sure to fully utilize the available resources. This
BRAMs 140 8 (5,71%) 36 (25,71%)
implies using compiler high optimization levels (flags -O3
DSP48 220 8 (3,64%) 8 (3,64%)
or -Otime), using NEON and VFP SIMD instructions (flag
–mfpu=neon-vfpv3), taking advantage of L1 and L2 caches,
loop unrolling, function inlinning, and overall improvements The ability to scale the architecture is limited by the avail-
to the algorithm that result in a reduced execution time. When able memory bandwidth. For each additional accelerator, an
comparing both hardware and software results, due to the additional 64-bits per clock cycle are required. The number of
use of a Fused Multiply-Adder in the PEs, several results clock cycles required to perform sparse matrix-vector multiply
are shown to be slightly different. This is to be expected, as on both the ARM processor (with all possible improvements)
the FMA unit performs a single rounding at the end of the and the architecture implemented in programmable logic was
fused multiplication-addition and the NEON unit truncates and measured. This measurement was performed by setting the
then rounds the result. This yields a more accurate result from timer to zero right before the start of computation with a
the accelerator when compared to the ARM processor, as is posterior value retrieval. As such, the time required to compute
specified in the IEEE standard for Floating-Point Arithmetic a matrix was not directly measured, but rather the number of
754. cycles passed in the timer during said computation. With the
division of this value by the frequency of the timer (355MHz)
the execution time can be obtained. The measurement of
A. Experimental Results the time required to compute several sparse matrices from
In terms of resource utilization and working frequency, two the University of Florida Sparse Matrix Collection [17] was
cases are presented in table I: resource utilization considering measured using the above described process, resulting in the
only the SpMV Accelerators (composed of two Packagers values presented in table II. Several sparse matrices were
and four PEs), and total resource utilization considering the tested. The purpose of each test was to verify if the compu-
whole system, i.e. the SpMV Accelerators, four DMA cores tation performed using the developed accelerator resulted in a
and AXI Interconnects for the AXI4-FULL and AXI4-LITE lower execution time when compared to the same computation
6

performed in the ARM processor. To verify the reliability of gives the minimum number of clock cycles required for the
the accelerator to process any sparse matrix, the chosen dataset nonzero element to enter the FMA unit when no hazard is
II included matrices whose parameters vary greatly. From detected. The value of variable W orst stall is given by the
square (N ×N , e.g. bcsstm27) to rectangular matrices (M ×N , number of pipeline stages of the FMA. In this architecture this
e.g. Flower 5 4) and diverse sparsity values ranging from as value is always equal to 8. N o stall is 2 as one clock cycle is
low as 0.02% (e.g. OPF 6000) to 23.08% (e.g. bibd 14 7). required to read an address from the input FIFO and another
Within all matrices the number of hazards varies greatly, for the corresponding y element to be at the output of local
from zero (e.g. bibd 14 7) to 47.46% (e.g. N pid). Each memory.
of these parameters influences computational time, although As one SpMV Accelerator is composed of one Packager and
only the accelerators are affected by RAW hazards, as the two PEs, and given that operations in the Packager and the PEs
ARM processor follows a distinct execution path. Attention are overlapped, the total number of clock cycles required to
was also given to the existing patterns in the matrices and process a nonzero element is given by the component with the
to the field of science each matrix belongs as to include the lowest throughput, the Packager or the two PEs, as shown by
most possible types in the dataset. The accelerator achieves equation 3.
an average performance of 624 MFLOPS for a computational 
efficiency of 78.03%. This can be seen in figure 10 where the Accelerator [cycles/nz] =max P ackager [cycles/nz] ;
performance of both the ARM processor and the architecture 
are represented in terms of achieved MFLOPS per matrix in P E [cycles/nz]
the dataset. The respective peak performances are represented 2
as well, clearly showing that the ARM processor achieves a (3)
performance level well below peak. The estimate for the total execution time must also consider
the transfer of all elements of vector y to external memory.
B. Analytical Model For this, the internal DMA FIFOs with a fixed length of 512
elements need to be filled with the maximum burst length (256
An analytical model was developed in order to predict the
elements) before the transfer is started. After the first burst of
performance of the architecture with larger sparse matrices,
256 elements, the process of filling the FIFO and transmitting
and when using more SpMV Accelerators. Given that the
the elements of vector y is performed simultaneously. With
transfer of data from the external memory and computation in
this latency accounted for, the execution time for an imple-
the SpMV Accelerator are performed concurrently, execution
mentation of k accelerators can be estimated by equation 4,
time depends mostly on the communication performance. To
where m is the number of elements of vector y, f represents
develop an analytical model each component that constitutes
the frequency at which the SpMV Accelerator works in MHz,
the implemented accelerators was modeled by determining the
and the reduction circuit now contributes to a larger portion of
number of clock cycles required to process a nonzero element.
execution time as it adds a term dependent on the number of
For the Packager unit, as one clock cycle is required per
pipeline stages of the adder used (Adder stages) times the
nonzero and structural element of the matrix, the total number
number of reduction levels (dlog2 (k)e), as previously depicted
of cycles required to process the entire sparse matrix is given
in figure 8:
by nz + n. As one instruction is sent to the PEs per nonzero
element, the number of clock cycles required per nonzero T heo exec time [us] =
element is given by equation 1. nz

P E[cycles/nz]

nz + n × max P ackager [cycles/nz] ; +
P ackager [cycles/nz] = (1) k 2
nz !
1
m + Adder stages × dlog2 (k)e + 256 ×
The number of cycles required by the PEs to perform com- f
putation depends on the number and distribution of hazards (4)
within the matrix. As no pattern can be assumed for the
nonzero elements within a sparse matrix, a simple Bernoulli To validate the developed model, estimated execution times
distribution is assumed for the occurrence of hazards. As were calculated for several sparse matrices using equation 4
such, the number of cycles required by each PE to process and compared to the values measured from execution in the
the nonzero elements arriving via the Packager is given by target device. Table III shows both experimental and estimated
equation 2: execution times (in microseconds), as well as the error between
measurement and estimation.
P E [cycles/nz] =W orst stall × phazard
(2)
+ N o stall × (1 − phazard ) C. Estimation for k accelerators
where phazard is the probability of a hazard in the sparse Assuming a target device with enough available bandwidth
matrix (equal to zero when no hazards occur and equal to one to feed a number k > 2 of accelerators, the analytical model
when a hazard exists for every nonzero element), W orst stall can estimate the execution times for any sparse matrix. Exe-
represents the highest number of clock cycles the PE needs to cution times were estimated for all matrices that constitute the
be stalled in order for the hazard to disperse and N o stall dataset, although, in order to avoid redundancy, matrices that
7

TABLE II
E XECUTION TIME REQUIRED TO COMPUTE SPARSE MATRICES IN BOTH THE ARM AND ONE S P MV ACCELERATOR AND RESPECTIVE OBTAINED SPEEDUP
AND ARCHITECTURAL EFFICIENCY

Matrix Rows Columns Sparsity (%) Hazards (%) ARM (us) Accelerator (us) Speedup Efficiency (%)
MK9-B3 945 1260 0.32 14.23 139 38 3.66 75.5
MK10-B4 945 4725 0.11 25.48 268 61 4.42 56.3
Maragal 2 555 350 2.24 2.50 128 32 3.95 93.7
Flower 5 4 5226 14721 0.06 3.17 1644 350 4.70 77.1
bibd 14 7 91 3432 23.08 0.00 2055 405 4.98 95.5
LP Pilot87 2030 6680 0.55 8.00 2222 489 4.55 81.1
bcsstm27 1224 1224 1.91 0.13 803 163 4.92 96.1
qc2534 2534 2534 3.63 0.17 6274 1215 5.16 98.2
big 13209 13209 0.05 28.26 3122 967 3.23 57.5
adder dcop 64 1813 1813 0.33 21.56 365 106 3.45 64.8
p2p gnutella08 6301 6301 0.05 20.45 532 233 2.29 68.2
N pid 3625 3923 0.06 47.17 341 132 2.59 51.2
OPF 6000 29902 29902 0.02 44.19 5104 1885 2.71 47.7
SiNa 5743 5743 0.31 1.63 3021 599 5.05 95.0

Fig. 10. ARM and Hardware performance measurement in MFLOPS for each matrix in the dataset

TABLE III relatively small (n  nz ), the time spent adding all partial
E XPERIMENTAL RESULTS VS . MODEL PREDICTIONS results of vector y from each accelerator in the reduction
Matrix HW exec (us) Theo exec (us) Error (%)
circuit is a small fraction of the total computational time. As
evidenced by table IV, speedup gains increase almost linearly
MK9-B3 38 39 2.82
with the number of implemented accelerators.
MK10-B4 61 59 2.11
Maragal 2 32 32 2.12 V. C OMPARISON TO EXISTING S P MV A RCHITECTURES
Flower 5 4 350 348 0.47 In table V, the performance of the implemented architecture
bibd 14 7 405 381 6.02 is compared to previous architectures. Naturally, sparsity of
LP Pilot87 489 488 0.25 the tested matrices influence the efficiency of the developed
bcsstm27 163 164 0.63 accelerator, as occurs on the remaining works. Sparse matrix
qc2534 1215 1205 0.80 parameters that adversely affect the implemented accelerator
big 967 980 1.36 performance include the number of rows as it increases the
adder dcop 64 106 110 3.69 size of the output vector required to be written to external
p2p gnutella08 233 233 0.19 memory, the ratio of matrix columns to nonzero elements as
N pid 132 136 3.45 it influences the performance of the Packager unit and the
OPF 6000 1885 1990 5.57 number of hazards within the matrix, as each requires the PE
SiNa 599 600 0.25 to stall the FMA core until the hazard is resolved.
All results were compared to a software-only execution in
the ARM, optimized to take full advantage of the existing
produced similar results are not shown in table III. The number vector floating-point unit (128-bit SIMD) and operating at a
of Accelerators was varied between three and seven and for frequency of 650MHz, more than six times the frequency at
comparison sake, results from implementations with one and which the programmable logic operates (100 MHz).
two accelerators are also included. Despite the length of the
reduction circuit increasing with the number of implemented VI. C ONCLUSION
accelerators (as given by the reduction time increasing with The results show that the architecture proposed in this work
k), as the number of elements passing through this block is is able to achieve an average of 624 MFLOPS (single precision
8

TABLE IV
H ARDWARE AND S OFTWARE EXECUTION TIMES , IN MICROSECONDS ( US ), AND RESPECTIVE SPEEDUP OBTAINED FOR A SYSTEM WITH k S P MV
ACCELERATORS USING C OLUMN B LOCKS

Number of mk10-b4 Maragal 2 Flower 5 4 bibd 14 7


Accelerators (k) HW ARM Speedup HW ARM Speedup HW ARM Speedup HW ARM Speedup
1 105 268 2.54 55 128 2.33 621 1644 2.65 811 2021 2.49
2 61 268 4.42 32 128 3.95 350 1644 4.70 381 2021 5.30
3 44 268 6.13 24 128 5.36 250 1644 6.56 255 2021 7.92
4 36 268 7.49 20 128 6.41 202 1644 8.15 192 2021 10.51
5 31 268 8.61 18 128 7.24 172 1644 9.54 155 2021 13.07
6 28 268 9.59 16 128 7.95 153 1644 10.76 129 2021 15.61
7 26 268 10.43 15 128 8.54 139 1644 11.85 111 2021 18.13

TABLE V
C OMPARISON OF FPGA S P MV A RCHITECTURES

[12] [13] [18] [14] [16] [15] This work


Virtex-5 Virtex-II Stratix-III Virtex-II Virtex-II Virtex-5 Zynq
FPGA
LX155T Pro 70 EP3E260 Pro 100 6000 SX95T XC7Z020
Frequency [MHz] 100 200 100 170 95 150 100
Memory Bw. [GB/s] 6.5 8 8.5 8.5 1.6 35.74 1.6
Number of PEs 16 4 6 5 3 64 4
Peak Perf. [GFLOPS] 3.2 1.6 1.2 1.7 0.57 19.2 0.8
Matrix Format CVBV CSR COO CSR SPAR CSC RCSC
Sparsity
Min - Max [%] 0.01-5.48 0.04-4.17 0.51-11.49 0.04-0.39 0.01-1.10 0.003-0.33 0.02-23.08
Average [%] 1.41 0.87 3.34 0.16 — 0.09 2.34
Efficiency
Min - Max [%] 1-7 20 - 79 5-7 50 - 98.4 1 - 74 69 - 99.8 54.5 - 98.2
Average [%] 4.4 42.6 5.6 79.4 55.6 91.9 78.5

floating-point). This corresponds to a performance efficiency ing of Blocked Sparse Kernels. In Technical Report
of 79%. These figures are better than those obtained by the ICL-UT-04-05.
ARM processor where 142 MFLOPS were measured, which [5] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick
corresponds to 22% of peak performance. This translates into and J. Demmel (2007). Optimization of Sparse Matrix-
a performance improvement of 4.39× on average. Given the Vector Multiplication on Emerging Multicore Platforms.
predictions of the analytical model, it is expected that the In Supercomputing.
architecture is able to scale with the available bandwidth. [6] J. D. Davis and E. S. Chung (2012). SpMV: A Memory-
Speedup values of 10.56× on average the execution in a gen- Bound Application on the GPU Stuck Between a Rock
eral purpose processor are to be expected, as long as enough and a Hard Place. In Microsoft Technical Report.
bandwidth is available, along with resources to implement the [7] E.-J. Im and K. Yelick (1999). Optimizing Sparse
architecture in programmable logic. Matrix Vector Multiplication on SMPs. In Ninth
SIAM Conference on Parallel Processing for Scientific
Computing (SIAM’1999). SIAM.
R EFERENCES [8] E.-J. Im and K. Yelick (1997). Optimizing sparse
matrix computations for register reuse in SPARSITY.
[1] C. Gfrerer (2012). Optimization Methods for Sparse In Proceedings of the International Conference on
Matrix-Vector Multiplication. In Seminar for Computer Computational Sciences - Part I, pages 127-136.
Science. [9] A. Pinar and M. T. Heath (1999). Improving per-
[2] Y. Saad (2003). Iterative methods for sparse linear sys- formance of sparse matrix-vector multiplication. In
tems. In SIAM International Conference on Data Mining Supercomputing.
(SIAM’2003). SIAM. [10] Wilson M. José, Ana Rita Silva, Mário P. Véstias,
[3] R. W. Vuduc and H.-J. Moon (2005). Fast sparse matrix- Horácio C. Neto (2014) Algorithm-oriented design of
vector multiplication by exploiting variable block struc- efficient many-core architectures applied to dense matrix
ture. In Proceedings of High-Performance Computing multiplication. In Analog Integrated Circuits and Signal
and Communications Conference (HPCC’2005), pages Processing, pages 9–16. IEEE.
807–816. [11] Mário P. Véstias (2014) DReaMaCHine - Design of a
[4] A. Buttari, V. Eijkhout, J. Langou and S. Filip- Reconfigurable Many-Core Architecture for High Perfor-
pone (2005). Performance Optimization and Model-
9

mance Computing. Instituto de Engenharia de Sistemas


e Computadores, Investigação e Desenvolvimento em
Lisboa (INESC ID/INESC/IST/ULisboa).
[12] S. Kestur, J.D. Davis and E.S. Chung (2012) Towards
a Universal FPGA Matrix-Vector Multiplication Archi-
tecture. In International Symposium Field-Programmable
Custom Computing Machines, (FCCM’2012)., pages
9–16. IEEE.
[13] L. Zhuo and V.K. Prasanna (2005). Sparse Matrix-
Vector multiplication on FPGAs. In Proceedings
of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA‘2005), pages
63-74.
[14] Yan Zhang, Y. H. Shalabi, R. Jain, K. K. Nagar
and J. D. Bakos (2009). FPGA vs. GPU for sparse
matrix vector multiply. In International Conference
of Field-Programmable Technology (FPT’2009). pages
255–262.
[15] R. Dorrance, F. Ren and D. Markovic (2014). A Scal-
able Sparse Matrix-Vector Multiplication Kernel for
Energy-Efficient Sparse-Blas on FPGAs. In Proceedings
of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA’2014), pages
161-170.
[16] D. Gregg, C. McSweeney, C. McElroy, F. Connor, S.
McGettrick, D. Moloney and D. Geraghty (2007). FPGA
Based Sparse Matrix Vector Multiplication using Com-
modity DRAM Memory. In International Conference
on Field Programmable Logic Applications (FPL’2007),
pages 786-791.
[17] T. Davis. University of Florida Sparse Matrix Collec-
tion. Available: http://www.cise.ufl.edu/research/sparse/
matrices.
[18] S. Sun, M. Monga, P.H. Jones and J. Zambreno (2012).
An I/O Bandwidth-Sensitive Sparse Matrix-Vector Mul-
tiplication Engine on FPGAs. In IEEE Trans. Circuits
Systems - Part I, (ITCS’2012), vol. 59, no. 1, pages
113–123. IEEE.

You might also like