Lasa Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs
Lasa Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs
2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) | 979-8-3503-1205-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/FCCM57271.2023.00013
Abstract—Linear algebra can often be significantly expedited for level 3 routines, yet saturate DRAM bandwidth for level 1
by spatial accelerators on FPGAs. As a broadly-adopted linear and 2 routines. It is also critical to leverage matrix properties to
algebra library, BLAS requires extensive optimizations for rou- reduce computations or memory accesses. For example, there
tines that vary vastly in data reuse, bottleneck resources, matrix
storage layouts, and data types. Existing solutions are stuck in are many zeros in triangular and banded matrices, which may
the dilemma of productivity and performance. enable 2× or more speedups.
We introduce Lasa, a framework composed of a programming Productivity and performance are considered as conflicting
model and a compiler, that addresses the dilemma by abstracting goals. High-level synthesis (HLS) produces a design quickly;
(for productivity) and specializing (for performance) the architec- however, it requires the code to be heavily restructured to get
ture of a spatial accelerator. Lasa abstracts a compute and its I/O
as two dataflow graphs. A compiler maps the graphs onto systolic high performance. The state-of-the-art HLS implementations,
arrays and a customized memory heirarchy. The compiler further Vitis BLAS [11] and FBLAS [12], realize only core routines
specializes the architecture transparently. In this framework, we with limited matrix layouts and data types. Recently, domain-
develop 14 key BLAS routines, and demonstrate performance in specific language (DSL) emerges to relieve this productivity-
parity with expert-written HLS code for BLAS level 3 routines, performance tension. SuSy [13] and HeteroCL [14] separate
>=80% machine peak performance for level 2 and 1 routines,
and 1.6X-7X speed up by taking advantage of matrix properties an algorithm from its customizations; the customizations are
of symmetry, triangularity and bandness. specified only, with their implementations left to a compiler.
Specification results in succinct code, and compiler optimiza-
I. I NTRODUCTION tions lead to performance. However, SuSy uses the low-level
The slowing of Moore’s Law has motivated a booming of primitives that hinder a compiler from optimizing a complex
spatial accelerators, which feature distributed and specialized design. HeteroCL relies on fixed code generation templates
processing elements (PEs) [1], [2]. Equipped with massive re- specific to matrix multiply and convolution. Neither of them
configurable resources and native DSP blocks, FPGAs exhibit takes advantages of properties of special matrices.
competitive performance when realizing accelerators [3], [4]. We propose a framework, Lasa, that achieves productivity
BLAS [5] is a basic linear algebra library upon which many and performance at the same time via abstracting (for produc-
applications and libraries are built, and thus it is of interest to tivity) and specializing (for performance) the architecture of a
accelerate. BLAS includes a wide range of vector and matrix spatial accelerator. It consists of a programming model (§III)
operations, called routines, that are classified into three levels: to specify a compute schedule and a memory schedule, and a
Level 3 (matrix-matrix operations) shows massive parallelism compiler (§IV) to realize the schedules. The compiler creates
and data reuse, and is compute-bound. Level 2 (matrix-vector systolic arrays per the compute schedule, and an I/O network
operations) exposes limited data reuse and is memory-bound. per the memory schedule. The compiler further analyzes and
Level 1 (scalar, vector, and vector-vector operations) exposes specializes the specified architecture transparently.
no data reuse and is completely memory-bound. A matrix can The compute schedule is specified with uniform recurrence
be general, symmetric, banded, triangular, or Hermitian, given equations (UREs) and space-time transforms (STTs). UREs
in full, band, or packed storage. Moreover, a routine can work describe a dataflow across compute operations, while an STT
on real or complex data with single or double precision. It is determines where (a PE) and when (a cycle) to execute the
challenging to productively develop so many routines. There compute operations. They are a classical and general approach
have been plenty of BLAS implementations on CPUs [6] and to express systolic, pipelined computes [15], [16], [17], [18].
GPUs [7], [8]. However, developing a library is prohibitively A systolic array features many PEs that work rhythmically.
expensive on FPGAs when performance is at stake [9], [10]. The memory schedule defines a dataflow across memories,
The various compute patterns in BLAS pose a huge chal- i.e., where and when to move, buffer, and re-layout data. A
lenge, and call for a systematic and flexible approach to fully memory is abstracted as a customizable storage of a tensor
extract performance. It is crucial to utilize compute resources (i.e., multi-dimensional array), called streaming tensor (sten-
sor). A stensor can be realized on a type of physical memory
* Corresponding author. (e.g., SRAM), and multiple stensors are connected to build a
35
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
1 // Subscripts used in UREs below
2 #define P kk, ii, k, i
3 #define P_kk_minus_1 kk-1, ii, k, i
4 #define P_k_minus_1 kk+KK-1, ii, k-1, i
5 #define lin_k kk + KK*k
6 #define lin_i ii + II*i
7 // Inputs from the host with type and number of dimensions
8 ImageParam A(Float(32), 2), x(Float(32), 1), y(Float(32), 1);
9 // Declare UREs with type and dimensions
10 URE UpMV(Float(32), {P}), UpMVOut(Float(32), {ii, i}); (a) Band storage layout and compute pattern (b) Systolic array dataflow
11 URE LowMV(Float(32), {P}), LowMVOut(Float(32), {ii, i}); Fig. 3. The storage, compute pattern and a systolic array design for GBMV.
12 URE Add(Float(32), {ii, i});
13 // Declare each stensor with its memory on device, except z on host.
14 Stensor DA(DRAM), DX_Up(DRAM), DX_Low(DRAM), DY(DRAM), DZ(DRAM);
from the previous iteration (P kk minus 1 or P k minus 1).
15 Stensor SA(SRAM), SX_Up(SRAM), SX_Low(SRAM), z; Line 23 presents another typical pattern that a URE collects
16
17 // Define a compute schedule with UREs and STTs.
the reduced result; once the reduction is done (kk==KK-1 &&
18 // URE for reducing a lower partial sum: k==i-1. We discard the result of k==i that will be accounted
19 LowMV(P) = select(kk==0 && k==0, 0,
20 select(kk==0, LowMV(P_k_minus_1), LowMV(P_kk_minus_1))
in UpMV), LowMV sends the result to LowMVOut. The two UREs
21 ) + A(lin_k, lin_i) * x(lin_k); are merged (Line 36). In general, F.merge ures(G, H, ...) puts
22 // URE for a lower partial sum after reduction is done:
23 LowMVOut(ii, i) = select(kk==KK-1 && k==i-1, LowMV(P));
UREs in the order of F, G, H... under the loop nest of F, and
24 // The definition of UpMV/UpMVOut F is used to represent all these UREs afterwards.
25 UpMV(P) = select(kk==0 && k==i, 0,
26 select(kk==0, UpMV(P_k_minus_1), UpMV(P_kk_minus_1))
The space time transform primitive maps space loops
27 ) + A(lin_k, lin_i) * x(lin_k); onto a systolic array. The space loops are unrolled and every
28 UpMVOut(ii, i) = select(kk==KK-1 && k==K-1, UpMV(P));
29 // UREs for computing a final result:
iteration of them becomes a PE, while the other loops (as time
30 Add(ii, i) = α*select(i==0, UpMVOut(ii, i), loops) are executed on the PEs sequentially. The execution of
31 UpMVOut(ii, i) + LowMVOut(ii, i)) + β*y(lin_i);
32 // Put the UREs for computing the lower partial sums together into
a PE is subject to its dependence on neighbor PEs.
33 // a loop nest, and build a PE array with an STT Dataflow Design Exploration: URE and STT can express
34 LowMV.reorder(kk, ii, i, k); // Set the loop order from innermost
35 LowMV.set_ranges(k, 0, K, i, k, I, ...); // Set the loops' ranges
a complex dataflow. Here we illustrate this point further with
36 LowMV.merge_ures(LowMVOut); // Merge LowMVOut into LowMV's loop nest GBMV, band matrix-vector multiplication routine.
LowMV.space_time_transform(kk); // Apply STT with space loop kk
37
38 // Put the UREs for computing the upper partial sums together into
As illustrated in Fig. 3a, matrix A in GBMV uses the band
39 // a second loop nest, and build a second PE array with an STT storage that specializes the compute: an inner product is along
UpMV.set_ranges(i, 0, I, k, i, K, ...);
40
41 UpMV.merge_ures(UpMVOut);
a diagonal of the matrix, e.g., a10 , a11 , a12 , and the results can
42 UpMV.space_time_transform(kk); come out from both the top and right boundary.
43 Add.set_ranges(i, 0, I, ii, 0, II); In the code snippet below, we specify that URE fX (Line 3)
44
45 // Define a memory schedule with stensors gets a value of x at a boundary (i==0), and forwards it along
46 A >> DA.out(kk) >> {UpMV, SA}; loop i. The value is used for reducing a partial sum (Line 4).
47 SA.scope(i).transpose().out(kk) >> LowMV;
48 x >> {DX_Up, DX_Low}; URE MV propagates the partial sum along loop k and i. After
49 DX_Up.out(kk) >> SX_Up.scope(k).out(kk) >> UpMV; the reduction is done, the results are collected from either the
50 DX_Low.out(kk) >> SX_Low.scope(i).out(kk) >> LowMV;
51 y >> DY >> Add; top or right boundary (Line 5-6). These UREs are merged and
52 Add >> DZ >> z(lin_i); a space-time transform creates a linear systolic array (Line 9).
Fig. 2. A specification for realizing the SYMV design in Fig. 1. Here select(c,
We can see that a partial sum with a10 is computed by PE 2 at
tv, fv) returns tv when c is true, and fv otherwise. And select(c, tv) returns time step 0 (t=0), then it is forwarded from PE 2 to PE 1 and
tv when c is true; this is used only after reduction is done. gets updated at time step 1. A partial sum may be forwarded
ory schedule; per the memory schedule, a memory hierarchy from PE 0 to PE 2 if it comes through the boundary between
is automatically built on the device, and the communication two tiles (for simplicity, the below code snippet assumes only
code is generated between the host and device. one tile). Therefore, the partial sums are transferred cyclically
Below we describe how to specify a compute and memory between the PEs, as illustrated in Fig. 3b.
1 URE fX(Float(32), {k, i}), MV(Float(32), {k, i});
schedule with SYMV as a main example. 2 URE TopMVOut(Float(32), {k}), RightMVOut(Float(32), {i});
3 fX(k,i) = select(i==0, x(k), fX(k, i-1));
4 MV(k,i) = select(k==0 i==I-1, 0, MV(k-1, i+1)) + A(k,i)*fX(k,i);
A. Compute schedule 5 TopMVOut(k) = select(i == 0, MV(k,i));
6 RightMVOut(i) = select(k == K-1, MV(k,i));
In a compute schedule, a group of UREs define a compute, 7 fX.set_ranges(i, 0, I, k, 0, K);
and then an STT maps the compute onto a systolic array. As 8 fX.merge_ures(MV, TopMVOut, RightMVOut);
9 fX.space_time_transform(i);
can be seen from Fig. 2, there can be multiple systolic arrays
expressed by separate groups of UREs and STT.
A URE is a recursive function with a constant dependence B. Memory schedule
distance across the entire iteration space. In our programming A memory schedule specifies how data are moved through
model, a URE is usually defined with a select primitive. For memory levels. A memory is abstracted as a streaming tensor
example, in Fig. 2, Line 19 shows a typical reduction pattern; (stensor), capable of receiving, buffering, and sending data of
LowMV(P) is initialized with 0 if from the current iteration, a a tensor. Table I shows the primitives for memory schedule.
reduction starts (kk==0 && k==0); otherwise, it gets the value
36
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
TABLE I
M EMORY S CHEDULE P RIMITIVES
Primitive Description
Stensor Declare a stensor to be realized at a memory
S[DRAM/SRAM/REG] level of the device. If no memory level is
given, it is in DRAM of the host by default. (a) Single precision (b) Double precision
S.scope(i) The capacity of the storage of S is determined Fig. 4. Break dependence (red line) in an inner product, where lk is the
by the memory footprint of the tensor when linearized index, i.e., kk+KK*k, where KK is the size of loop kk.
executing an iteration of loop i.
A. Compilation flow
S.transpose() Transpose the (2-dimensional) storage of S
S.out(dim0, dim1, ...) S outputs a sub-tensor at once, determined by Our compiler is built on Halide [21], and all optimizations
the sizes of the dimensions (i.e. loops) are realized as transformations on its loop-based IR. Given a
S(index0, index1, ...) Direct access of data in the storage of S with specification, a compiler builds compute modules (loop nests
the given indices. enclosing UREs) and memory modules (loop nests enclosing
S0 >> S1 Stream tensor data from S0 to S1 memory operands). These modules are connected and special-
S0 >> {S1, S2} Shorthand for S0 >> S1 and S0 >> S2
ized with specified and transparent optimizations. Finally, we
generate OpenCL code for Intel FPGAs.
{S1, S2} >> S0 Shorthand for S1 >> S0 and S2 >> S0
B. Compute optimizations
A stensor has an internal storage allocated in DRAM, SRAM, Our compiler builds a systolic array with a group of merged
or REG to buffer a sub-tensor referenced during the execution UREs and an STT. The space loops are unrolled into PEs, and
of a scope loop. In Fig. 2, stensor SA has a scope i (Line 47), shift registers are allocated inside PEs. A reference to a URE
storing a tile of A under loop i, i.e., a matrix of size [II,KK]. value is transparently replaced with that to a shift register. For
The entire tensor is stored if no scope is given. The compiler example, LowMV(P kk minus 1) and LowMV(P k minus 1) in
implements an SRAM stensor as a double buffer by default to Line 20, Fig. 2 point to the same register, r[1], which keeps
enable writes and reads in parallel. When data are consumed the result from the previous step. Only live values are kept in
in the same order as they are produced, the compiler builds a the registers by shifting them once every cycle.
FIFO or a single buffer, instead, for efficiency. Inner product is a common operation in BLAS. Our com-
A spatially-connected I/O network is specified by connect- piler can recognize this pattern, break up its reduction depen-
ing stensors and PE arrays (inputs are treated as stensors for dence, and create an adder tree for it. Fig. 4a exemplifies an
simplicity) with streaming operators (>>). It describes both the inner product with reduction loops k and kk; kk is unrolled to
data transfer between a host and a device stensor, and also a generate an adder tree, while k is pipelined. Each iteration of
memory hierarchy with stensors at different memory levels on k computes a partial sum cc that is added with c. There is a
the device. For example, Line 46 in Fig. 2 tells the compiler to self-dependence across the outer loop iterations on c += cc.
offload matrix A from the host DRAM to device DRAM, and With single precision, the addition operation takes one cycle
create a two-level hierarchy using device DRAM and SRAM. and the self-dependence does not hinder effective pipelining.
Depending on data locations, data are transferred over a PCIe For double precision, however, the addition requires 2 cycles
bus, a device DRAM bus, or on-chip channels (i.e., FIFOs). and the pipeline would stall until the previous one completes.
A stensor sends out a sub-tensor with the dimensions given Fig. 4b illustrates our solution; the results are kept in different
in an out primitive every time. For example, a vector of size registers (e.g., c0 -c3 ) that are rotated to be used, and thus the
KK is sent out from DA (.out(kk) in Line 46 of Fig. 2), which dependence spans multiple (4) iterations for pipelining.
tells the compiler to load KK values from all DRAM channels
(see § IV-C). SA, an SRAM stensor, is partitioned into locally- C. Memory optimizations
connected banks (see § IV-C) to output KK values as specified A memory operand in the UREs is isolated into a chain of
in Line 47. Besides, a stensor gets its storage layout from the memory modules, one module for one stensor. Every memory
accessing indices of its tensor. For example, SA has the layout module inherits the loop structure of the URE from which it is
A(lin k, lin i) by default (in Line 21). We can transpose SA’s isolated, but has only memory load/store operations. The data
(2-dimensional) storage with a transpose() primitive, which are passed through the device memory hierarchy by replacing
tells the compiler to build a parallel access buffer (§ IV-C). memory operations with channel operations. For example, the
In summary, stensor provides a versatile and efficient mem- operand x is isolated from UpMV into a chain of 3 modules (x
ory abstraction, where technical intricacies are handled by the >> DX UP >> SX UP, Line 48-49, Fig. 2) that inherit the same
compiler, hidden from programmers. loops, kk, ii, k, i. The compiler then inserts memory/channel
operations, e.g., rch(SX Up) above loop kk in Fig. 1.
IV. C OMPILER O PTIMIZATIONS The compiler specializes a memory module as a finite state
machine autonomously executing three tasks: it receives data
In this section, we first describe the compilation flow, then
from its producer(s), buffers the data into an internal storage,
discuss the major optimizations of compute and memory.
37
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. The input data path from a host to LowMV, which passes through two
(a) An input network (b) An output network stensors, as specified in Line 46-47 of Fig. 2.
Fig. 5. I/O network design for a 2-D systolic array.
V. E VALUATION
and reads data and sends them to its consumer(s). To reduce
In this section, we evaluate performance of 14 key BLAS
the redundant DRAM loads, our compiler analyzes a memory
routines in all the 3 BLAS levels, engineered in our approach.
operand and removes the reuse loops from a DRAM module.
We target two generations of Intel FPGAs, Arria 10 GX1150
A buffer is inserted under the stensor’s scope loop. Typically,
(A10) and Stratix 10 SX2800 (S10) on Intel DevCloud [23].
data are repeatedly sent from a double buffer when executing
the reuse loops. For example, loops ii and i can be removed A. Level 3 routines
from DX Up, and a double buffer is built in SX Up.
We first discuss SGEMM, single-precision matrix multipli-
The buffer of a stensor is divided into banks if an .out() cation, which is the most critical BLAS routine and has been
primitive is given. The banks are distributed across an FPGA extensively studied by previous work.
plane, and connected for efficient data I/O. Fig. 5 illustrates a The results are shown in Table II. We create a 16 × 10 × 8
typical I/O network built for a 2-D systolic array. In Fig. 5a, (vector size × width × height) systolic array on A10, and a
inputs are propagated to SRAM banks through a daisy chain, 16 × 16 × 14 systolic array on S10. Each PE is assigned with
which is a pipeline composed of registers. Every bank keeps 210 × 210 outputs. Note the outermost tile size can be passed
the data belonging to it from the attached register. In Fig. 5b, at runtime, so that a design can target various shapes without
each PE drains its outputs to a pipeline of local registers. The re-synthesizing it. The theoretical machine peak throughput is
pipeline is connected to the pipeline of the next PE along the defined as the number of used DSPs × frequency × 2 (for a
same column into a bigger pipeline. These bigger pipelines of multiplication and an addition). Besides reporting throughput,
all columns shift once, and the outputs from the heads of the we report efficiency, the measured throughput divided by the
bigger pipelines are gathered and sent out. peak, to evaluate the quality of a generated design. We achieve
Our compiler can transparently optimize the data movement nearly 100% efficiencies on both FPGAs, 1.1× throughput to
between two stensors resident in the host and device DRAM. SuSy, and 1.4×/1.8× to FBLAS. We deliver 96% throughput
To overlap the host-to-device data offloading with execution, of the expert-written design shipped with OpenCL SDK [24]
a pseudo double buffer is created in the device DRAM, which on both FPGAs, confirming our high performance.
accepts inputs from (or sends outputs to) the host tile by tile,
TABLE II
thereby saving space and hiding its transfer time. The data are SGEMM ON FPGA S
serialized so that a device reads/writes them sequentially, and
are vectorized (e.g., DA.out(kk) in Line 46, Fig. 2) to saturate Device MHz GOPs LUTs DSPs BRAMs Efficiency
DRAM bandwidth. The compiler can automatically interleave AutoSA [25] U250 300 930 52% 68% - 94%
data across multiple DRAM channels in the stride of memory SuSy [13] A10 202 547 40% 93% 32% 96%
interface width, i.e., the width × the number of channels used A10 192 346 40% 71% 80% 83%
equals the vector size (KK). Therefore, a vector is loaded from FBLAS [12]
S10 216 1280 35% 57% 66% 91%
multiple DRAM channels in parallel. A10 269 646 32% 79% 50% 100%
Expert [24]
To realize a .transpose() primitive, the compiler builds a S10 261 1871 42% 62% 27% 100%
parallel access buffer [22] that allows parallel writes/reads for A10 244 620 49% 86% 76% 97%
Lasa
the data in the same row or the same column by distributing S10 251 1790 48% 63% 35% 99%
data into different banks. It is realized as a double buffer, to
be written and read at the same time. We have developed other level 3 routines (except TRSM),
Fig. 6 illustrates the data movement from the host DRAM to and the results are shown in Table III. We reuse the SGEMM
the device DRAM and SRAM. The SRAM stensor is a parallel design for SYMM (not shown in this table), because SYMM
access buffer that stores a tile of the input matrix. The first, is compute-bound, and leveraging the symmetry to save mem-
second, third, ... row in the tile are rotated by 0, 1, 2, ... times, ory bandwidth cannot improve its performance. For TRMM,
respectively, then every element in a row is stored in a bank. SYRK, and SYR2K, however, a half of computations can be
We can find that the data in every row and column are hence saved. In other words, if we regard a special matrix as a gen-
distributed to different banks and are readable in parallel. For eral matrix, the throughput is effectively twice the measured
example, the last column (ax3) is placed on a diagonal of the throughput. We introduce speed up, the effective throughput
buffer, which can be rotated back and read out. in a general matrix divided by the theoretical machine peak, to
evaluate the improvements by exploiting special properties of
38
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
TABLE III TABLE IV
OTHER L EVEL 3 ROUTINES ( SINGLE PRECISION ON A10) L EVEL 2 ROUTINES ( SINGLE PRECISION )
MHz GOPs LUTs DSPs BRAMs Effici- Speed Device MHz GOPs LUTs DSPs BRAMs Effici- Speed
ency up ency up
TRMM 238 471 44% 68% 36% 97% 1.93× A10 282 16 20% 2% 19% 93% -
GEMV
S10 302 36 27% 1% 8% 94% -
SYRK 259 513 43% 68% 36% 96% 1.93×
A10 267 15 39% 4% 43% 90% 1.79×
SYR2K 253 476 48% 68% 45% 91% 1.81× SYMV
S10 247 30 56% 2% 50% 79% 1.58×
HEMM 230 582 41% 86% 74% 97% - A10 254 15 23% 2% 22% 87% 1.75×
TRMV
HERK 228 459 35% 68% 37% 98% 1.96× S10 267 33 31% 1% 9% 86% 1.71×
HER2K 252 426 42% 68% 42% 82% 1.63× A10 277 16 21% 2% 25% 92% 7.35×
GBMV
S10 292 35 27% 1% 9% 91% 7.32×
matrices. Particularly, for TRMM / SYRK, we iterate only the GER
A10 259 7.6 20% 1% 21% 89% -
S10 343 15 27% 1% 8% 78% -
upper triangle of the input/output matrices. For SYR2K, we
compute two symmetric points of the result matrix using two
TABLE V
separate systolic arrays, and then add them up. We have also A L EVEL 1 ROUTINE ( ON A10)
developed the routines for Hermitian matrix, HEMM, HERK,
and HER2K, with conjugate numbers at symmetric locations. Data Type MHz GOPs LUTs DSPs BRAMs Efficiency
They show efficiencies similar to single-precision ones. S 308 8 17% 1% 15% 93%
D 283 4 27% 2% 19% 96%
DOT
B. Level 2 routines C 323 16 17% 2% 15% 94%
Z 248 7.5 37% 4% 23% 88%
We have developed several level 2 routines that cover most
matrix types, and the results are shown in Table IV. The peak
C. Level 1 routines
throughput is defined as the operational intensity × memory
bandwidth since they are memory-bound. A10 (S10) has two Level 1 routines are mostly auxiliary operations, like copy-
(four) DDR channels to offer 34 GB/s (76 GB/s) bandwidth. ing a vector. We report the result of DOT in Table V, with all
Every channel is associated with a 64B interface. To saturate the four data types, single precision real (S), double precision
memory bandwidth, a design’s frequency should reach that of real (D), complex (C) and complex double (Z).
an interface, i.e., ideally, the frequency should be no less than DOT is also memory-bound, and we achieved nearly peak
the DRAM bandwidth / the number of channels / 64, i.e., 267 efficiency (defined the same as that for level 2 routines). The
Mhz for A10 and 300 Mhz for S10. We achieved 95%-105% operational intensity differs with data types. One multiplica-
(82%-114%) of the ideal frequency on A10 (S10). tion of two complex numbers needs 4 multiplications and 2
We create a systolic array with 32 (64) PEs on A10 (S10), additions of real numbers, and one addition of two complex
and thus the DSP usage is low. Each routine uses a 215 × 215 numbers needs two additions of real numbers. Therefore, the
input matrix. The throughput of GEMV approaches the peak number of operations with S:D:C:Z types per multiply-add is
determined by the operational intensity ( 21 ) with a sufficiently 2:2:8:8. The bytes of a S:D:C:Z number are 4:8:8:16. Thus
high frequency. A part of memory bandwidth can be saved by the operational intensity of S:D:C:Z data types follows a ratio
taking advantage of special properties of other matrix-vector of 24 : 28 : 88 : 16
8
= 2:1:4:2. The throughputs achieved for
multiply routines, SYMV, TRMV, and GBMV. Therefore, we the four data types roughly follow this ratio.
still use speed up to indicate the acceleration. The design of VI. C ONCLUSION
SYMV is illustrated in Fig. 1. For TRMV, we can load only
a half of the input matrix. For GBMV, the speed up depends We proposed Lasa, a programming framework for produc-
on the sparsity of the input matrix. We assume 212 diagonals tively implementing high performance linear algebra routines.
in a 215 × 215 matrix. Every diagonal is stored as a row, and Our programming model is succinct yet expressive, capable of
thus we get a 212 × 215 matrix in band storage, which results describing various compute patterns and memory systems. Our
15
in at most 2212 = 8× speed up. GER takes two input vectors compiler performs extensive optimizations on both compute
and performs an outer product, which has only 14 operational and I/O. Using this framework, we have developed key BLAS
intensity with no data reuse. The input vectors have relatively routines, and reported impressive performance on the routines
low memory bandwidth usage because the output is a matrix across all the three levels.
that dominates memory accesses. ACKNOWLEDGMENT
FBLAS has realized GEMV and GER, and reported > 100
This work is supported in part by the National Natural Sci-
GFlops throughput by generating the inputs on the device. We
ence Foundation of China (NSFC) under grant No.U21B2017,
argue that it is impractical to assume data are readily available
62272434. We appreciate the support of Christopher J Hughes,
on the device, as in most cases, data need to be loaded from
Pradeep Dubey, Piotr Ratuszniak, Geoff Lowney, and John C
DRAM. A larger PE array cannot improve performance, since
Kreatsoulas from Intel.
these routines are bounded by memory.
39
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES generating high-performance spatial hardware for dense tensor compu-
tations,” in Proceedings of the 27th Annual International Symposium on
[1] Q. Xiao, S. Zheng, B. Wu, P. Xu, X. Qian, and Y. Liang, “HASCO: To- Field-Programmable Custom Computing Machines (FCCM), 2019, pp.
wards agile hardware and software co-design for tensor computation,” in 181–189.
Proceedings of the 48th Annual International Symposium on Computer [21] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Ama-
Architecture (ISCA), 2021, pp. 1055–1068. rasinghe, “Halide: a language and compiler for optimizing parallelism,
locality, and recomputation in image processing pipelines,” Acm Sigplan
[2] S. Zheng, R. Chen, A. Wei, Y. Jin, Q. Han, L. Lu, B. Wu, X. Li,
Notices, vol. 48, no. 6, pp. 519–530, 2013.
S. Yan, and Y. Liang, “AMOS: enabling automatic mapping for tensor
[22] B. Hanounik, “Diagonal registers: novel vector register file design for
computations on spatial accelerators with hardware abstraction,” in
high performance and multimedia computing,” Master’s thesis, Citeseer,
Proceedings of the 49th Annual International Symposium on Computer
2000.
Architecture (ISCA), 2022, pp. 874–887.
[23] Intel, “DevCloud,” 2022, https://devcloud.intel.com.
[3] Q. Xiao and Y. Liang, “Towards agile dnn accelerator design using in-
[24] ——, “Intel FPGA SDK for OpenCL Software Technology,” 2022,
cremental synthesis on FPGAs,” in Proceedings of the 30th ACM/SIGDA
https://www.intel.com/content/www/us/en/software/programmable/
International Symposium on Field-Programmable Gate Arrays (FPGA),
sdk-for-opencl/overview.html.
2022, pp. 42–48.
[25] J. Wang, L. Guo, and J. Cong, “AutoSA: A polyhedral compiler for
[4] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, high-performance systolic arrays on fpga,” in Proceedings of the 29th
and J. Cong, “Automated systolic array architecture synthesis for high ACM/SIGDA International Symposium on Field-Programmable Gate
throughput cnn inference on fpgas,” in Proceedings of the 54th Annual Arrays (FPGA), 2021, pp. 93–104.
Design Automation Conference (DAC), 2017, pp. 1–6.
[5] S. Hammarling, J. Dongarra, J. Du Croz, and R. Hanson, “An extended
set of fortran basic linear algebra subprograms,” ACM Transactions on
Mathematical Software, vol. 14, no. 1, pp. 1–32, 1988.
[6] Intel, “oneAPI Math Kernel Library,” 2022, https://www.intel.com/
content/www/us/en/developer/tools/oneapi/onemkl.html.
[7] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou,
H. Ltaief, P. Luszczek, and S. Tomov, “Numerical linear algebra on
emerging architectures: The plasma and magma projects,” in Journal of
Physics: Conference Series, vol. 180, no. 1, 2009, pp. 12–37.
[8] Nvidia, “Basic Linear Algebra on GPUs,” 2022, https://developer.nvidia.
com/cublas.
[9] Y.-H. Lai, E. Ustun, S. Xiang, Z. Fang, H. Rong, and Z. Zhang, “Pro-
gramming and synthesis for software-defined fpga acceleration: status
and future prospects,” ACM Transactions on Reconfigurable Technology
and Systems (TRETS), vol. 14, no. 4, pp. 1–39, 2021.
[10] Q. Xiao, L. Lu, J. Xie, and Y. Liang, “FCNNLib: An efficient and
flexible convolution algorithm library on fpgas,” in Proceedings of the
57th Annual Design Automation Conference (DAC), 2020, pp. 1–6.
[11] Xilinx, “Vitis BLAS Library,” 2022, https://github.com/Xilinx/Vitis
Libraries/tree/master/blas.
[12] T. De Matteis, J. de Fine Licht, and T. Hoefler, “FBLAS: Streaming
linear algebra on fpga,” in Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis
(SC), 2020, pp. 1–13.
[13] Y.-H. Lai, H. Rong, S. Zheng, W. Zhang, X. Cui, Y. Jia, J. Wang,
B. Sullivan, Z. Zhang, Y. Liang et al., “Susy: A programming model for
productive construction of high-performance systolic arrays on fpgas,”
in Proceedings of the 39th International Conference on Computer-Aided
Design (ICCAD), 2020, pp. 1–9.
[14] Y.-H. Lai, Y. Chi, Y. Hu, J. Wang, C. H. Yu, Y. Zhou, J. Cong, and
Z. Zhang, “HeteroCL: A multi-paradigm programming infrastructure for
software-defined reconfigurable computing,” in Proceedings of the 27th
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA), 2019, pp. 242–251.
[15] P. Quinton, “Automatic synthesis of systolic arrays from uniform recur-
rent equations,” ACM SIGARCH Computer architecture news, vol. 12,
no. 3, pp. 208–214, 1984.
[16] L. Lu, N. Guan, Y. Wang, L. Jia, Z. Luo, J. Yin, J. Cong, and Y. Liang,
“TENET: A framework for modeling tensor dataflow based on relation-
centric notation,” in Proceedings of the 48th Annual International
Symposium on Computer Architecture (ISCA), 2021, pp. 720–733.
[17] L. Jia, Z. Luo, L. Lu, and Y. Liang, “Tensorlib: A spatial accelerator
generation framework for tensor algebra,” in Proceedings of the 58th
Annual Design Automation Conference (DAC), 2021, pp. 865–870.
[18] L. Jia, Y. Wang, J. Leng, and Y. Liang, “EMS: efficient memory
subsystem synthesis for spatial accelerators,” in Proceedings of the 59th
Annual Design Automation Conference (DAC), 2022, pp. 67–72.
[19] S. Xiang, Y.-H. Lai, Y. Zhou, H. Chen, N. Zhang, D. Pal, and Z. Zhang,
“HeteroFlow: An accelerator programming model with decoupled data
placement for software-defined FPGAs,” in Proceedings of the 30th
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA), 2022, pp. 78–88.
[20] N. Srivastava, H. Rong, P. Barua, G. Feng, H. Cao, Z. Zhang, D. Al-
bonesi, V. Sarkar, W. Chen, P. Petersen et al., “T2S-Tensor: Productively
40
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.