Lasa Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs

The document presents Lasa, a framework designed to enhance productivity and performance in linear algebra computations on FPGAs by abstracting and specializing the architecture of spatial accelerators. It includes a programming model and compiler that effectively manage dataflow graphs for compute and memory schedules, resulting in significant performance improvements for BLAS routines. The framework achieves performance comparable to expert-written high-level synthesis code while leveraging matrix properties to optimize computations and memory usage.

Uploaded by

ssfortynine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views7 pages

Lasa Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs

Uploaded by

ssfortynine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Lasa: Abstraction and Specialization for Productive

and Performant Linear Algebra on FPGAs
Xiaochen Hao1 , Mingzhe Zhang2 , Ce Sun6 , Zhuofu Tao4 , Hongbo Rong3
Yu Zhang6 , Lei He4,5 , Eric Petit3 , Wenguang Chen2 , Yun Liang1*
1
Peking University, 2 Tsinghua University, 3 Intel, 4 University of California, Los Angeles
5
Eastern Institute for Advanced Study, China, 6 University of Science and Technology of China
1
[email protected], [email protected], 3 {hongbo.rong, eric.petit}@intel.com

Abstract—Linear algebra can often be significantly expedited for level 3 routines, yet saturate DRAM bandwidth for level 1
by spatial accelerators on FPGAs. As a broadly-adopted linear and 2 routines. It is also critical to leverage matrix properties to
algebra library, BLAS requires extensive optimizations for rou- reduce computations or memory accesses. For example, there
tines that vary vastly in data reuse, bottleneck resources, matrix
storage layouts, and data types. Existing solutions are stuck in are many zeros in triangular and banded matrices, which may
the dilemma of productivity and performance. enable 2× or more speedups.
We introduce Lasa, a framework composed of a programming Productivity and performance are considered as conflicting
model and a compiler, that addresses the dilemma by abstracting goals. High-level synthesis (HLS) produces a design quickly;
(for productivity) and specializing (for performance) the architec- however, it requires the code to be heavily restructured to get
ture of a spatial accelerator. Lasa abstracts a compute and its I/O
as two dataflow graphs. A compiler maps the graphs onto systolic high performance. The state-of-the-art HLS implementations,
arrays and a customized memory heirarchy. The compiler further Vitis BLAS [11] and FBLAS [12], realize only core routines
specializes the architecture transparently. In this framework, we with limited matrix layouts and data types. Recently, domain-
develop 14 key BLAS routines, and demonstrate performance in specific language (DSL) emerges to relieve this productivity-
parity with expert-written HLS code for BLAS level 3 routines, performance tension. SuSy [13] and HeteroCL [14] separate
>=80% machine peak performance for level 2 and 1 routines,
and 1.6X-7X speed up by taking advantage of matrix properties an algorithm from its customizations; the customizations are
of symmetry, triangularity and bandness. specified only, with their implementations left to a compiler.
Specification results in succinct code, and compiler optimiza-
I. I NTRODUCTION tions lead to performance. However, SuSy uses the low-level
The slowing of Moore’s Law has motivated a booming of primitives that hinder a compiler from optimizing a complex
spatial accelerators, which feature distributed and specialized design. HeteroCL relies on fixed code generation templates
processing elements (PEs) [1], [2]. Equipped with massive re- specific to matrix multiply and convolution. Neither of them
configurable resources and native DSP blocks, FPGAs exhibit takes advantages of properties of special matrices.
competitive performance when realizing accelerators [3], [4]. We propose a framework, Lasa, that achieves productivity
BLAS [5] is a basic linear algebra library upon which many and performance at the same time via abstracting (for produc-
applications and libraries are built, and thus it is of interest to tivity) and specializing (for performance) the architecture of a
accelerate. BLAS includes a wide range of vector and matrix spatial accelerator. It consists of a programming model (§III)
operations, called routines, that are classified into three levels: to specify a compute schedule and a memory schedule, and a
Level 3 (matrix-matrix operations) shows massive parallelism compiler (§IV) to realize the schedules. The compiler creates
and data reuse, and is compute-bound. Level 2 (matrix-vector systolic arrays per the compute schedule, and an I/O network
operations) exposes limited data reuse and is memory-bound. per the memory schedule. The compiler further analyzes and
Level 1 (scalar, vector, and vector-vector operations) exposes specializes the specified architecture transparently.
no data reuse and is completely memory-bound. A matrix can The compute schedule is specified with uniform recurrence
be general, symmetric, banded, triangular, or Hermitian, given equations (UREs) and space-time transforms (STTs). UREs
in full, band, or packed storage. Moreover, a routine can work describe a dataflow across compute operations, while an STT
on real or complex data with single or double precision. It is determines where (a PE) and when (a cycle) to execute the
challenging to productively develop so many routines. There compute operations. They are a classical and general approach
have been plenty of BLAS implementations on CPUs [6] and to express systolic, pipelined computes [15], [16], [17], [18].
GPUs [7], [8]. However, developing a library is prohibitively A systolic array features many PEs that work rhythmically.
expensive on FPGAs when performance is at stake [9], [10]. The memory schedule defines a dataflow across memories,
The various compute patterns in BLAS pose a huge chal- i.e., where and when to move, buffer, and re-layout data. A
lenge, and call for a systematic and flexible approach to fully memory is abstracted as a customizable storage of a tensor
extract performance. It is crucial to utilize compute resources (i.e., multi-dimensional array), called streaming tensor (sten-
sor). A stensor can be realized on a type of physical memory
* Corresponding author. (e.g., SRAM), and multiple stensors are connected to build a

DOI 10.1109/FCCM57271.2023.00013
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
network for data I/O. This I/O network can span the host and
device, and the part on the device forms user-defined caches
(i.e., a memory hierarchy) for the compiler to automatically
synthesize. This greatly reduces the burden of programmers,
since FPGAs usually do not have hardware caches and it is a
big hurdle for programmers to build such caches themselves.
Stensors can be customized for efficiency. For example, data
are moved through the memories in a way that each stensor
outputs a sub-tensor every time.
Our programming model builds upon dataflow graphs: Both
the compute and memory schedules are dataflow graphs. The
graphs are intuitive and expressive; both the spatial and tempo-
ral aspects of an accelerator are abstracted and encoded. With
compute pipelines and memory hierarchy explicitly exposed,
the graphs facilitate compiler analysis and optimizations. This
dataflow-style representation for both compute and memory
distinguishes Lasa from others [13], [14], [19], [20]. Fig. 1. A SYMV design. Here I, II, K and KK are the upper bounds of loop
We developed 14 key BLAS routines with Lasa, including i, ii, k and kk, respectively. y=rch(x) reads data into y from the output
all level 3 routines except TRSM (GEMM, SYMM, HEMM, channel of x. wch(x,y) writes y into the input channel of x.
TRMM, SYRK, SYR2K, HERK, HER2K), level 2 routines Matrix A is tiled, and only the upper triangle is loaded tile-by-
(GEMV, SYMV, TRMV, GBMV, GER), and a level 1 routine tile from left to right and top down. Every cycle, KK number
(DOT)* . These routines cover various matrices: general (GE), of elements are loaded simultaneously from DA to utilize the
symmetric (SY), Hermitian (HE), triangular (TR), and general DRAM bandwidth. Since matrix A is symmetric, we create an
banded (GB), and various data types: single (S), double (D), SRAM buffer, SA, to store a tile and output its transposition,
complex (C), and complex double (Z). We gain performance effectively generating the lower triangle on the fly, top down
in parity with expert-written HLS code for level 3 routines, and from left to right. The current tile of the upper and lower
>=80% machine peak throughput for level 2 and 1 routines, triangle, as well as the corresponding elements of vector x to
and 1.6×-7× speed up by taking advantage of matrix proper- be multiplied, are sent into two PE arrays, UpMV and LowMV,
ties of symmetry, triangularity and bandness. respectively, to compute partial sums. Vector x is passed from
Overall, this paper makes the following contributions: DX to SXUp and SXLow , two SRAM buffers whose elements are
• A productive programming model that abstracts a spatial used by UpMV and LowMV, respectively. A PE array, Add, then
accelerator in dataflow graphs. adds the partial sums from the two PE arrays, scales the result
• A compiler that specializes the architecture for FPGAs with factor α, and adds it with βy. The final results are stored
both explicitly and transparently. in DRAM, denoted by DZ, and sent to host as z.
• A high-performance library of BLAS routines written in The top half of Fig. 1 shows the computations inside every
the programming model, achieving performance compa- PE array. The original loops in Equ. 1 are split. Loop i and
rable to the state-of-the-arts. k iterate the tiles of an input matrix A. LowMV and UpMV use
different loop orders. A buffer sends values to a PE array via
II. A N I LLUSTRATING E XAMPLE a channel (i.e. a FIFO). Every time, KK number of values are
transferred over the channel as a short vector, and that vector
We use SYMV, symmetric matrix-vector multiplication, as
can then be accessed element-wise for each value inside. The
an example to illustrate typical challenges in BLAS routines
innermost loop kk performs an inner product with a row in a
and our solution:
tile and the corresponding values of x. The loop is unrolled,
N −1 resulting in an adder tree circuit.
zi = α k=0 Ai,k xk + βyi (1)
Our programming model is succinct yet flexible to specify
where α and β are scalars, x and y are N -wide vectors and the design, as shown in Fig. 2. Line 7-15 declare the inputs,
A is an N × N symmetric matrix. z is the output vector, and UREs (representing the compute dataflow), and stensors (rep-
once calculated, overwrites y. This kernel is memory-bound resenting the buffers). Line 18-43 define a compute schedule
and it is crucial to well utilize the memory bandwidth. that describes the compute and maps it onto PEs. Line 45-52
We design an accelerator, as shown in Fig. 1. This design define a memory schedule with connected stensors, mimicing
leverages the symmetry of the input matrix A to save half of the dataflow graph in Fig. 1.
the memory bandwidth. The input matrix A and vectors x, y III. P ROGRAMMING MODEL
are stored in DRAM, denoted by DA, DX, and DY, respectively.
Our platform model has a CPU host and an FPGA device.
* The rest of BLAS routines are under development. This project is open- The host transfers inputs to the device, then waits for the re-
sourced at: https://github.com/IntelLabs/t2sp sults. Our programming model specifies a compute and mem-

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
1 // Subscripts used in UREs below
2 #define P kk, ii, k, i
3 #define P_kk_minus_1 kk-1, ii, k, i
4 #define P_k_minus_1 kk+KK-1, ii, k-1, i
5 #define lin_k kk + KK*k
6 #define lin_i ii + II*i
7 // Inputs from the host with type and number of dimensions
8 ImageParam A(Float(32), 2), x(Float(32), 1), y(Float(32), 1);
9 // Declare UREs with type and dimensions
10 URE UpMV(Float(32), {P}), UpMVOut(Float(32), {ii, i}); (a) Band storage layout and compute pattern (b) Systolic array dataflow
11 URE LowMV(Float(32), {P}), LowMVOut(Float(32), {ii, i}); Fig. 3. The storage, compute pattern and a systolic array design for GBMV.
12 URE Add(Float(32), {ii, i});
13 // Declare each stensor with its memory on device, except z on host.
14 Stensor DA(DRAM), DX_Up(DRAM), DX_Low(DRAM), DY(DRAM), DZ(DRAM);
from the previous iteration (P kk minus 1 or P k minus 1).
15 Stensor SA(SRAM), SX_Up(SRAM), SX_Low(SRAM), z; Line 23 presents another typical pattern that a URE collects
16
17 // Define a compute schedule with UREs and STTs.
the reduced result; once the reduction is done (kk==KK-1 &&
18 // URE for reducing a lower partial sum: k==i-1. We discard the result of k==i that will be accounted
19 LowMV(P) = select(kk==0 && k==0, 0,
20 select(kk==0, LowMV(P_k_minus_1), LowMV(P_kk_minus_1))
in UpMV), LowMV sends the result to LowMVOut. The two UREs
21 ) + A(lin_k, lin_i) * x(lin_k); are merged (Line 36). In general, F.merge ures(G, H, ...) puts
22 // URE for a lower partial sum after reduction is done:
23 LowMVOut(ii, i) = select(kk==KK-1 && k==i-1, LowMV(P));
UREs in the order of F, G, H... under the loop nest of F, and
24 // The definition of UpMV/UpMVOut F is used to represent all these UREs afterwards.
25 UpMV(P) = select(kk==0 && k==i, 0,
26 select(kk==0, UpMV(P_k_minus_1), UpMV(P_kk_minus_1))
The space time transform primitive maps space loops
27 ) + A(lin_k, lin_i) * x(lin_k); onto a systolic array. The space loops are unrolled and every
28 UpMVOut(ii, i) = select(kk==KK-1 && k==K-1, UpMV(P));
29 // UREs for computing a final result:
iteration of them becomes a PE, while the other loops (as time
30 Add(ii, i) = α*select(i==0, UpMVOut(ii, i), loops) are executed on the PEs sequentially. The execution of
31 UpMVOut(ii, i) + LowMVOut(ii, i)) + β*y(lin_i);
32 // Put the UREs for computing the lower partial sums together into
a PE is subject to its dependence on neighbor PEs.
33 // a loop nest, and build a PE array with an STT Dataflow Design Exploration: URE and STT can express
34 LowMV.reorder(kk, ii, i, k); // Set the loop order from innermost
35 LowMV.set_ranges(k, 0, K, i, k, I, ...); // Set the loops' ranges
a complex dataflow. Here we illustrate this point further with
36 LowMV.merge_ures(LowMVOut); // Merge LowMVOut into LowMV's loop nest GBMV, band matrix-vector multiplication routine.
LowMV.space_time_transform(kk); // Apply STT with space loop kk
37
38 // Put the UREs for computing the upper partial sums together into
As illustrated in Fig. 3a, matrix A in GBMV uses the band
39 // a second loop nest, and build a second PE array with an STT storage that specializes the compute: an inner product is along
UpMV.set_ranges(i, 0, I, k, i, K, ...);
40
41 UpMV.merge_ures(UpMVOut);
a diagonal of the matrix, e.g., a10 , a11 , a12 , and the results can
42 UpMV.space_time_transform(kk); come out from both the top and right boundary.
43 Add.set_ranges(i, 0, I, ii, 0, II); In the code snippet below, we specify that URE fX (Line 3)
44
45 // Define a memory schedule with stensors gets a value of x at a boundary (i==0), and forwards it along
46 A >> DA.out(kk) >> {UpMV, SA}; loop i. The value is used for reducing a partial sum (Line 4).
47 SA.scope(i).transpose().out(kk) >> LowMV;
48 x >> {DX_Up, DX_Low}; URE MV propagates the partial sum along loop k and i. After
49 DX_Up.out(kk) >> SX_Up.scope(k).out(kk) >> UpMV; the reduction is done, the results are collected from either the
50 DX_Low.out(kk) >> SX_Low.scope(i).out(kk) >> LowMV;
51 y >> DY >> Add; top or right boundary (Line 5-6). These UREs are merged and
52 Add >> DZ >> z(lin_i); a space-time transform creates a linear systolic array (Line 9).
Fig. 2. A specification for realizing the SYMV design in Fig. 1. Here select(c,
We can see that a partial sum with a10 is computed by PE 2 at
tv, fv) returns tv when c is true, and fv otherwise. And select(c, tv) returns time step 0 (t=0), then it is forwarded from PE 2 to PE 1 and
tv when c is true; this is used only after reduction is done. gets updated at time step 1. A partial sum may be forwarded
ory schedule; per the memory schedule, a memory hierarchy from PE 0 to PE 2 if it comes through the boundary between
is automatically built on the device, and the communication two tiles (for simplicity, the below code snippet assumes only
code is generated between the host and device. one tile). Therefore, the partial sums are transferred cyclically
Below we describe how to specify a compute and memory between the PEs, as illustrated in Fig. 3b.
1 URE fX(Float(32), {k, i}), MV(Float(32), {k, i});
schedule with SYMV as a main example. 2 URE TopMVOut(Float(32), {k}), RightMVOut(Float(32), {i});
3 fX(k,i) = select(i==0, x(k), fX(k, i-1));
4 MV(k,i) = select(k==0 i==I-1, 0, MV(k-1, i+1)) + A(k,i)*fX(k,i);
A. Compute schedule 5 TopMVOut(k) = select(i == 0, MV(k,i));
6 RightMVOut(i) = select(k == K-1, MV(k,i));
In a compute schedule, a group of UREs define a compute, 7 fX.set_ranges(i, 0, I, k, 0, K);
and then an STT maps the compute onto a systolic array. As 8 fX.merge_ures(MV, TopMVOut, RightMVOut);
9 fX.space_time_transform(i);
can be seen from Fig. 2, there can be multiple systolic arrays
expressed by separate groups of UREs and STT.
A URE is a recursive function with a constant dependence B. Memory schedule
distance across the entire iteration space. In our programming A memory schedule specifies how data are moved through
model, a URE is usually defined with a select primitive. For memory levels. A memory is abstracted as a streaming tensor
example, in Fig. 2, Line 19 shows a typical reduction pattern; (stensor), capable of receiving, buffering, and sending data of
LowMV(P) is initialized with 0 if from the current iteration, a a tensor. Table I shows the primitives for memory schedule.
reduction starts (kk==0 && k==0); otherwise, it gets the value

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
TABLE I
M EMORY S CHEDULE P RIMITIVES

Primitive Description
Stensor Declare a stensor to be realized at a memory
S[DRAM/SRAM/REG] level of the device. If no memory level is
given, it is in DRAM of the host by default. (a) Single precision (b) Double precision
S.scope(i) The capacity of the storage of S is determined Fig. 4. Break dependence (red line) in an inner product, where lk is the
by the memory footprint of the tensor when linearized index, i.e., kk+KK*k, where KK is the size of loop kk.
executing an iteration of loop i.
A. Compilation flow
S.transpose() Transpose the (2-dimensional) storage of S
S.out(dim0, dim1, ...) S outputs a sub-tensor at once, determined by Our compiler is built on Halide [21], and all optimizations
the sizes of the dimensions (i.e. loops) are realized as transformations on its loop-based IR. Given a
S(index0, index1, ...) Direct access of data in the storage of S with specification, a compiler builds compute modules (loop nests
the given indices. enclosing UREs) and memory modules (loop nests enclosing
S0 >> S1 Stream tensor data from S0 to S1 memory operands). These modules are connected and special-
S0 >> {S1, S2} Shorthand for S0 >> S1 and S0 >> S2
ized with specified and transparent optimizations. Finally, we
generate OpenCL code for Intel FPGAs.
{S1, S2} >> S0 Shorthand for S1 >> S0 and S2 >> S0
B. Compute optimizations
A stensor has an internal storage allocated in DRAM, SRAM, Our compiler builds a systolic array with a group of merged
or REG to buffer a sub-tensor referenced during the execution UREs and an STT. The space loops are unrolled into PEs, and
of a scope loop. In Fig. 2, stensor SA has a scope i (Line 47), shift registers are allocated inside PEs. A reference to a URE
storing a tile of A under loop i, i.e., a matrix of size [II,KK]. value is transparently replaced with that to a shift register. For
The entire tensor is stored if no scope is given. The compiler example, LowMV(P kk minus 1) and LowMV(P k minus 1) in
implements an SRAM stensor as a double buffer by default to Line 20, Fig. 2 point to the same register, r[1], which keeps
enable writes and reads in parallel. When data are consumed the result from the previous step. Only live values are kept in
in the same order as they are produced, the compiler builds a the registers by shifting them once every cycle.
FIFO or a single buffer, instead, for efficiency. Inner product is a common operation in BLAS. Our com-
A spatially-connected I/O network is specified by connect- piler can recognize this pattern, break up its reduction depen-
ing stensors and PE arrays (inputs are treated as stensors for dence, and create an adder tree for it. Fig. 4a exemplifies an
simplicity) with streaming operators (>>). It describes both the inner product with reduction loops k and kk; kk is unrolled to
data transfer between a host and a device stensor, and also a generate an adder tree, while k is pipelined. Each iteration of
memory hierarchy with stensors at different memory levels on k computes a partial sum cc that is added with c. There is a
the device. For example, Line 46 in Fig. 2 tells the compiler to self-dependence across the outer loop iterations on c += cc.
offload matrix A from the host DRAM to device DRAM, and With single precision, the addition operation takes one cycle
create a two-level hierarchy using device DRAM and SRAM. and the self-dependence does not hinder effective pipelining.
Depending on data locations, data are transferred over a PCIe For double precision, however, the addition requires 2 cycles
bus, a device DRAM bus, or on-chip channels (i.e., FIFOs). and the pipeline would stall until the previous one completes.
A stensor sends out a sub-tensor with the dimensions given Fig. 4b illustrates our solution; the results are kept in different
in an out primitive every time. For example, a vector of size registers (e.g., c0 -c3 ) that are rotated to be used, and thus the
KK is sent out from DA (.out(kk) in Line 46 of Fig. 2), which dependence spans multiple (4) iterations for pipelining.
tells the compiler to load KK values from all DRAM channels
(see § IV-C). SA, an SRAM stensor, is partitioned into locally- C. Memory optimizations
connected banks (see § IV-C) to output KK values as specified A memory operand in the UREs is isolated into a chain of
in Line 47. Besides, a stensor gets its storage layout from the memory modules, one module for one stensor. Every memory
accessing indices of its tensor. For example, SA has the layout module inherits the loop structure of the URE from which it is
A(lin k, lin i) by default (in Line 21). We can transpose SA’s isolated, but has only memory load/store operations. The data
(2-dimensional) storage with a transpose() primitive, which are passed through the device memory hierarchy by replacing
tells the compiler to build a parallel access buffer (§ IV-C). memory operations with channel operations. For example, the
In summary, stensor provides a versatile and efficient mem- operand x is isolated from UpMV into a chain of 3 modules (x
ory abstraction, where technical intricacies are handled by the >> DX UP >> SX UP, Line 48-49, Fig. 2) that inherit the same
compiler, hidden from programmers. loops, kk, ii, k, i. The compiler then inserts memory/channel
operations, e.g., rch(SX Up) above loop kk in Fig. 1.
IV. C OMPILER O PTIMIZATIONS The compiler specializes a memory module as a finite state
machine autonomously executing three tasks: it receives data
In this section, we first describe the compilation flow, then
from its producer(s), buffers the data into an internal storage,
discuss the major optimizations of compute and memory.

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. The input data path from a host to LowMV, which passes through two
(a) An input network (b) An output network stensors, as specified in Line 46-47 of Fig. 2.
Fig. 5. I/O network design for a 2-D systolic array.
V. E VALUATION
and reads data and sends them to its consumer(s). To reduce
In this section, we evaluate performance of 14 key BLAS
the redundant DRAM loads, our compiler analyzes a memory
routines in all the 3 BLAS levels, engineered in our approach.
operand and removes the reuse loops from a DRAM module.
We target two generations of Intel FPGAs, Arria 10 GX1150
A buffer is inserted under the stensor’s scope loop. Typically,
(A10) and Stratix 10 SX2800 (S10) on Intel DevCloud [23].
data are repeatedly sent from a double buffer when executing
the reuse loops. For example, loops ii and i can be removed A. Level 3 routines
from DX Up, and a double buffer is built in SX Up.
We first discuss SGEMM, single-precision matrix multipli-
The buffer of a stensor is divided into banks if an .out() cation, which is the most critical BLAS routine and has been
primitive is given. The banks are distributed across an FPGA extensively studied by previous work.
plane, and connected for efficient data I/O. Fig. 5 illustrates a The results are shown in Table II. We create a 16 × 10 × 8
typical I/O network built for a 2-D systolic array. In Fig. 5a, (vector size × width × height) systolic array on A10, and a
inputs are propagated to SRAM banks through a daisy chain, 16 × 16 × 14 systolic array on S10. Each PE is assigned with
which is a pipeline composed of registers. Every bank keeps 210 × 210 outputs. Note the outermost tile size can be passed
the data belonging to it from the attached register. In Fig. 5b, at runtime, so that a design can target various shapes without
each PE drains its outputs to a pipeline of local registers. The re-synthesizing it. The theoretical machine peak throughput is
pipeline is connected to the pipeline of the next PE along the defined as the number of used DSPs × frequency × 2 (for a
same column into a bigger pipeline. These bigger pipelines of multiplication and an addition). Besides reporting throughput,
all columns shift once, and the outputs from the heads of the we report efficiency, the measured throughput divided by the
bigger pipelines are gathered and sent out. peak, to evaluate the quality of a generated design. We achieve
Our compiler can transparently optimize the data movement nearly 100% efficiencies on both FPGAs, 1.1× throughput to
between two stensors resident in the host and device DRAM. SuSy, and 1.4×/1.8× to FBLAS. We deliver 96% throughput
To overlap the host-to-device data offloading with execution, of the expert-written design shipped with OpenCL SDK [24]
a pseudo double buffer is created in the device DRAM, which on both FPGAs, confirming our high performance.
accepts inputs from (or sends outputs to) the host tile by tile,
TABLE II
thereby saving space and hiding its transfer time. The data are SGEMM ON FPGA S
serialized so that a device reads/writes them sequentially, and
are vectorized (e.g., DA.out(kk) in Line 46, Fig. 2) to saturate Device MHz GOPs LUTs DSPs BRAMs Efficiency
DRAM bandwidth. The compiler can automatically interleave AutoSA [25] U250 300 930 52% 68% - 94%
data across multiple DRAM channels in the stride of memory SuSy [13] A10 202 547 40% 93% 32% 96%
interface width, i.e., the width × the number of channels used A10 192 346 40% 71% 80% 83%
equals the vector size (KK). Therefore, a vector is loaded from FBLAS [12]
S10 216 1280 35% 57% 66% 91%
multiple DRAM channels in parallel. A10 269 646 32% 79% 50% 100%
Expert [24]
To realize a .transpose() primitive, the compiler builds a S10 261 1871 42% 62% 27% 100%
parallel access buffer [22] that allows parallel writes/reads for A10 244 620 49% 86% 76% 97%
Lasa
the data in the same row or the same column by distributing S10 251 1790 48% 63% 35% 99%
data into different banks. It is realized as a double buffer, to
be written and read at the same time. We have developed other level 3 routines (except TRSM),
Fig. 6 illustrates the data movement from the host DRAM to and the results are shown in Table III. We reuse the SGEMM
the device DRAM and SRAM. The SRAM stensor is a parallel design for SYMM (not shown in this table), because SYMM
access buffer that stores a tile of the input matrix. The first, is compute-bound, and leveraging the symmetry to save mem-
second, third, ... row in the tile are rotated by 0, 1, 2, ... times, ory bandwidth cannot improve its performance. For TRMM,
respectively, then every element in a row is stored in a bank. SYRK, and SYR2K, however, a half of computations can be
We can find that the data in every row and column are hence saved. In other words, if we regard a special matrix as a gen-
distributed to different banks and are readable in parallel. For eral matrix, the throughput is effectively twice the measured
example, the last column (ax3) is placed on a diagonal of the throughput. We introduce speed up, the effective throughput
buffer, which can be rotated back and read out. in a general matrix divided by the theoretical machine peak, to
evaluate the improvements by exploiting special properties of

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
TABLE III TABLE IV
OTHER L EVEL 3 ROUTINES ( SINGLE PRECISION ON A10) L EVEL 2 ROUTINES ( SINGLE PRECISION )

MHz GOPs LUTs DSPs BRAMs Effici- Speed Device MHz GOPs LUTs DSPs BRAMs Effici- Speed
ency up ency up
TRMM 238 471 44% 68% 36% 97% 1.93× A10 282 16 20% 2% 19% 93% -
GEMV
S10 302 36 27% 1% 8% 94% -
SYRK 259 513 43% 68% 36% 96% 1.93×
A10 267 15 39% 4% 43% 90% 1.79×
SYR2K 253 476 48% 68% 45% 91% 1.81× SYMV
S10 247 30 56% 2% 50% 79% 1.58×
HEMM 230 582 41% 86% 74% 97% - A10 254 15 23% 2% 22% 87% 1.75×
TRMV
HERK 228 459 35% 68% 37% 98% 1.96× S10 267 33 31% 1% 9% 86% 1.71×
HER2K 252 426 42% 68% 42% 82% 1.63× A10 277 16 21% 2% 25% 92% 7.35×
GBMV
S10 292 35 27% 1% 9% 91% 7.32×
matrices. Particularly, for TRMM / SYRK, we iterate only the GER
A10 259 7.6 20% 1% 21% 89% -
S10 343 15 27% 1% 8% 78% -
upper triangle of the input/output matrices. For SYR2K, we
compute two symmetric points of the result matrix using two
TABLE V
separate systolic arrays, and then add them up. We have also A L EVEL 1 ROUTINE ( ON A10)
developed the routines for Hermitian matrix, HEMM, HERK,
and HER2K, with conjugate numbers at symmetric locations. Data Type MHz GOPs LUTs DSPs BRAMs Efficiency
They show efficiencies similar to single-precision ones. S 308 8 17% 1% 15% 93%
D 283 4 27% 2% 19% 96%
DOT
B. Level 2 routines C 323 16 17% 2% 15% 94%
Z 248 7.5 37% 4% 23% 88%
We have developed several level 2 routines that cover most
matrix types, and the results are shown in Table IV. The peak
C. Level 1 routines
throughput is defined as the operational intensity × memory
bandwidth since they are memory-bound. A10 (S10) has two Level 1 routines are mostly auxiliary operations, like copy-
(four) DDR channels to offer 34 GB/s (76 GB/s) bandwidth. ing a vector. We report the result of DOT in Table V, with all
Every channel is associated with a 64B interface. To saturate the four data types, single precision real (S), double precision
memory bandwidth, a design’s frequency should reach that of real (D), complex (C) and complex double (Z).
an interface, i.e., ideally, the frequency should be no less than DOT is also memory-bound, and we achieved nearly peak
the DRAM bandwidth / the number of channels / 64, i.e., 267 efficiency (defined the same as that for level 2 routines). The
Mhz for A10 and 300 Mhz for S10. We achieved 95%-105% operational intensity differs with data types. One multiplica-
(82%-114%) of the ideal frequency on A10 (S10). tion of two complex numbers needs 4 multiplications and 2
We create a systolic array with 32 (64) PEs on A10 (S10), additions of real numbers, and one addition of two complex
and thus the DSP usage is low. Each routine uses a 215 × 215 numbers needs two additions of real numbers. Therefore, the
input matrix. The throughput of GEMV approaches the peak number of operations with S:D:C:Z types per multiply-add is
determined by the operational intensity ( 21 ) with a sufficiently 2:2:8:8. The bytes of a S:D:C:Z number are 4:8:8:16. Thus
high frequency. A part of memory bandwidth can be saved by the operational intensity of S:D:C:Z data types follows a ratio
taking advantage of special properties of other matrix-vector of 24 : 28 : 88 : 16
8
= 2:1:4:2. The throughputs achieved for
multiply routines, SYMV, TRMV, and GBMV. Therefore, we the four data types roughly follow this ratio.
still use speed up to indicate the acceleration. The design of VI. C ONCLUSION
SYMV is illustrated in Fig. 1. For TRMV, we can load only
a half of the input matrix. For GBMV, the speed up depends We proposed Lasa, a programming framework for produc-
on the sparsity of the input matrix. We assume 212 diagonals tively implementing high performance linear algebra routines.
in a 215 × 215 matrix. Every diagonal is stored as a row, and Our programming model is succinct yet expressive, capable of
thus we get a 212 × 215 matrix in band storage, which results describing various compute patterns and memory systems. Our
15
in at most 2212 = 8× speed up. GER takes two input vectors compiler performs extensive optimizations on both compute
and performs an outer product, which has only 14 operational and I/O. Using this framework, we have developed key BLAS
intensity with no data reuse. The input vectors have relatively routines, and reported impressive performance on the routines
low memory bandwidth usage because the output is a matrix across all the three levels.
that dominates memory accesses. ACKNOWLEDGMENT
FBLAS has realized GEMV and GER, and reported > 100
This work is supported in part by the National Natural Sci-
GFlops throughput by generating the inputs on the device. We
ence Foundation of China (NSFC) under grant No.U21B2017,
argue that it is impractical to assume data are readily available
62272434. We appreciate the support of Christopher J Hughes,
on the device, as in most cases, data need to be loaded from
Pradeep Dubey, Piotr Ratuszniak, Geoff Lowney, and John C
DRAM. A larger PE array cannot improve performance, since
Kreatsoulas from Intel.
these routines are bounded by memory.

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES generating high-performance spatial hardware for dense tensor compu-
tations,” in Proceedings of the 27th Annual International Symposium on
[1] Q. Xiao, S. Zheng, B. Wu, P. Xu, X. Qian, and Y. Liang, “HASCO: To- Field-Programmable Custom Computing Machines (FCCM), 2019, pp.
wards agile hardware and software co-design for tensor computation,” in 181–189.
Proceedings of the 48th Annual International Symposium on Computer [21] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Ama-
Architecture (ISCA), 2021, pp. 1055–1068. rasinghe, “Halide: a language and compiler for optimizing parallelism,
locality, and recomputation in image processing pipelines,” Acm Sigplan
[2] S. Zheng, R. Chen, A. Wei, Y. Jin, Q. Han, L. Lu, B. Wu, X. Li,
Notices, vol. 48, no. 6, pp. 519–530, 2013.
S. Yan, and Y. Liang, “AMOS: enabling automatic mapping for tensor
[22] B. Hanounik, “Diagonal registers: novel vector register file design for
computations on spatial accelerators with hardware abstraction,” in
high performance and multimedia computing,” Master’s thesis, Citeseer,
Proceedings of the 49th Annual International Symposium on Computer
2000.
Architecture (ISCA), 2022, pp. 874–887.
[23] Intel, “DevCloud,” 2022, https://devcloud.intel.com.
[3] Q. Xiao and Y. Liang, “Towards agile dnn accelerator design using in-
[24] ——, “Intel FPGA SDK for OpenCL Software Technology,” 2022,
cremental synthesis on FPGAs,” in Proceedings of the 30th ACM/SIGDA
https://www.intel.com/content/www/us/en/software/programmable/
International Symposium on Field-Programmable Gate Arrays (FPGA),
sdk-for-opencl/overview.html.
2022, pp. 42–48.
[25] J. Wang, L. Guo, and J. Cong, “AutoSA: A polyhedral compiler for
[4] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, high-performance systolic arrays on fpga,” in Proceedings of the 29th
and J. Cong, “Automated systolic array architecture synthesis for high ACM/SIGDA International Symposium on Field-Programmable Gate
throughput cnn inference on fpgas,” in Proceedings of the 54th Annual Arrays (FPGA), 2021, pp. 93–104.
Design Automation Conference (DAC), 2017, pp. 1–6.
[5] S. Hammarling, J. Dongarra, J. Du Croz, and R. Hanson, “An extended
set of fortran basic linear algebra subprograms,” ACM Transactions on
Mathematical Software, vol. 14, no. 1, pp. 1–32, 1988.
[6] Intel, “oneAPI Math Kernel Library,” 2022, https://www.intel.com/
content/www/us/en/developer/tools/oneapi/onemkl.html.
[7] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou,
H. Ltaief, P. Luszczek, and S. Tomov, “Numerical linear algebra on
emerging architectures: The plasma and magma projects,” in Journal of
Physics: Conference Series, vol. 180, no. 1, 2009, pp. 12–37.
[8] Nvidia, “Basic Linear Algebra on GPUs,” 2022, https://developer.nvidia.
com/cublas.
[9] Y.-H. Lai, E. Ustun, S. Xiang, Z. Fang, H. Rong, and Z. Zhang, “Pro-
gramming and synthesis for software-defined fpga acceleration: status
and future prospects,” ACM Transactions on Reconfigurable Technology
and Systems (TRETS), vol. 14, no. 4, pp. 1–39, 2021.
[10] Q. Xiao, L. Lu, J. Xie, and Y. Liang, “FCNNLib: An efficient and
flexible convolution algorithm library on fpgas,” in Proceedings of the
57th Annual Design Automation Conference (DAC), 2020, pp. 1–6.
[11] Xilinx, “Vitis BLAS Library,” 2022, https://github.com/Xilinx/Vitis
Libraries/tree/master/blas.
[12] T. De Matteis, J. de Fine Licht, and T. Hoefler, “FBLAS: Streaming
linear algebra on fpga,” in Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis
(SC), 2020, pp. 1–13.
[13] Y.-H. Lai, H. Rong, S. Zheng, W. Zhang, X. Cui, Y. Jia, J. Wang,
B. Sullivan, Z. Zhang, Y. Liang et al., “Susy: A programming model for
productive construction of high-performance systolic arrays on fpgas,”
in Proceedings of the 39th International Conference on Computer-Aided
Design (ICCAD), 2020, pp. 1–9.
[14] Y.-H. Lai, Y. Chi, Y. Hu, J. Wang, C. H. Yu, Y. Zhou, J. Cong, and
Z. Zhang, “HeteroCL: A multi-paradigm programming infrastructure for
software-defined reconfigurable computing,” in Proceedings of the 27th
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA), 2019, pp. 242–251.
[15] P. Quinton, “Automatic synthesis of systolic arrays from uniform recur-
rent equations,” ACM SIGARCH Computer architecture news, vol. 12,
no. 3, pp. 208–214, 1984.
[16] L. Lu, N. Guan, Y. Wang, L. Jia, Z. Luo, J. Yin, J. Cong, and Y. Liang,
“TENET: A framework for modeling tensor dataflow based on relation-
centric notation,” in Proceedings of the 48th Annual International
Symposium on Computer Architecture (ISCA), 2021, pp. 720–733.
[17] L. Jia, Z. Luo, L. Lu, and Y. Liang, “Tensorlib: A spatial accelerator
generation framework for tensor algebra,” in Proceedings of the 58th
Annual Design Automation Conference (DAC), 2021, pp. 865–870.
[18] L. Jia, Y. Wang, J. Leng, and Y. Liang, “EMS: efficient memory
subsystem synthesis for spatial accelerators,” in Proceedings of the 59th
Annual Design Automation Conference (DAC), 2022, pp. 67–72.
[19] S. Xiang, Y.-H. Lai, Y. Zhou, H. Chen, N. Zhang, D. Pal, and Z. Zhang,
“HeteroFlow: An accelerator programming model with decoupled data
placement for software-defined FPGAs,” in Proceedings of the 30th
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA), 2022, pp. 78–88.
[20] N. Srivastava, H. Rong, P. Barua, G. Feng, H. Cao, Z. Zhang, D. Al-
bonesi, V. Sarkar, W. Chen, P. Petersen et al., “T2S-Tensor: Productively

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on January 21,2025 at 10:17:18 UTC from IEEE Xplore. Restrictions apply.

Peter Kogge - The Architecture of Symbolic Computers
100% (2)
Peter Kogge - The Architecture of Symbolic Computers
764 pages
C Language Introduction
100% (1)
C Language Introduction
192 pages
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware_Software Co-optimization for Probabilistic AI and Sparse Linear Algebra-Springer (155)
No ratings yet
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware_Software Co-optimization for Probabilistic AI and Sparse Linear Algebra-Springer (155)
155 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Digital Abstraction
No ratings yet
Digital Abstraction
35 pages
Modeling a Non-uniform Memory Access Architecture for Optimizing
No ratings yet
Modeling a Non-uniform Memory Access Architecture for Optimizing
79 pages
Warrior_400i___Warrior_500i_-_Servicio_-_464523001_IN
No ratings yet
Warrior_400i___Warrior_500i_-_Servicio_-_464523001_IN
80 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
lecture04_High-Level Digital Design Automation
No ratings yet
lecture04_High-Level Digital Design Automation
30 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Brksec 2463
No ratings yet
Brksec 2463
79 pages
icl-utk-1031-2017
No ratings yet
icl-utk-1031-2017
45 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Solution Design Guide
No ratings yet
Solution Design Guide
41 pages
Meghnad Saha Answers
No ratings yet
Meghnad Saha Answers
25 pages
written_asst1
No ratings yet
written_asst1
31 pages
3677035
No ratings yet
3677035
30 pages
Speedup 0912
No ratings yet
Speedup 0912
34 pages
bhh93
No ratings yet
bhh93
27 pages
Luong Thesis
No ratings yet
Luong Thesis
81 pages
p1
No ratings yet
p1
30 pages
Presentation Chaaba Ayoub Genie Logiciel
No ratings yet
Presentation Chaaba Ayoub Genie Logiciel
48 pages
Topology Project
33% (3)
Topology Project
14 pages
Ecp2018 Magma Tutorial 1
No ratings yet
Ecp2018 Magma Tutorial 1
50 pages
Performance PDF
No ratings yet
Performance PDF
109 pages
Fundamentals of Computer 20CS11T Chapter 3
No ratings yet
Fundamentals of Computer 20CS11T Chapter 3
16 pages
Understanding The Potential of FPGA-Based Spatial Acceleration For Large Language Model Inference
No ratings yet
Understanding The Potential of FPGA-Based Spatial Acceleration For Large Language Model Inference
28 pages
CHAPTER 4 Advances in High Performance Processing of Seismic Data - 1989 - Handbook of Geophysical Exploration Seismic Exploration
No ratings yet
CHAPTER 4 Advances in High Performance Processing of Seismic Data - 1989 - Handbook of Geophysical Exploration Seismic Exploration
26 pages
CS0051 - M3-Locks and Liveness
No ratings yet
CS0051 - M3-Locks and Liveness
30 pages
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
No ratings yet
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
39 pages
UVM Test Bench
No ratings yet
UVM Test Bench
48 pages
202206-ECCOMAS-Oslo-Article
No ratings yet
202206-ECCOMAS-Oslo-Article
12 pages
Hostel Management System Patel Jigarkumar M
No ratings yet
Hostel Management System Patel Jigarkumar M
52 pages
DAEGEN a Modular Compiler for Exploring Decoupled Spatial Accelerators
No ratings yet
DAEGEN a Modular Compiler for Exploring Decoupled Spatial Accelerators
5 pages
Hls PDF
No ratings yet
Hls PDF
68 pages
Pulse Connect Secure: Release Notes PCS 8.2R5 Build 49363
No ratings yet
Pulse Connect Secure: Release Notes PCS 8.2R5 Build 49363
29 pages
Using Python For Large Scale Linear Alge
No ratings yet
Using Python For Large Scale Linear Alge
11 pages
Debug App in Android
No ratings yet
Debug App in Android
19 pages
An Efficient Use of The Computer Memory Blas, Linpack, Eispack, Lapack
No ratings yet
An Efficient Use of The Computer Memory Blas, Linpack, Eispack, Lapack
81 pages
jBASE Transaction Journaling
No ratings yet
jBASE Transaction Journaling
50 pages
Design and Implementation of The PULSAR Programming System For Large Scale Computing
No ratings yet
Design and Implementation of The PULSAR Programming System For Large Scale Computing
23 pages
System Design
No ratings yet
System Design
29 pages
Unit 3
No ratings yet
Unit 3
10 pages
Lab Manual 1 Flow Chart Repaired)
No ratings yet
Lab Manual 1 Flow Chart Repaired)
9 pages
Secure Terraform Directory Structure
No ratings yet
Secure Terraform Directory Structure
8 pages
Week 1
No ratings yet
Week 1
5 pages
TI Application Note For NIBP Meassure
No ratings yet
TI Application Note For NIBP Meassure
20 pages
Kai Hwang: Advanced Computer Architecture
No ratings yet
Kai Hwang: Advanced Computer Architecture
9 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
PRINCIPAL Class-II Computer Question Paper (1)
No ratings yet
PRINCIPAL Class-II Computer Question Paper (1)
3 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Incremental Model
No ratings yet
Incremental Model
4 pages
AD11002 - TA5000 Series Brochure - RSPs
No ratings yet
AD11002 - TA5000 Series Brochure - RSPs
7 pages
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
No ratings yet
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
22 pages
C6-R4 Multimedia Systems 2 Pages
No ratings yet
C6-R4 Multimedia Systems 2 Pages
16 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
Samsung Performance Restoration v.1.0 Installation Guide
No ratings yet
Samsung Performance Restoration v.1.0 Installation Guide
19 pages
Stanley Assignment
No ratings yet
Stanley Assignment
6 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
008 Architectural
No ratings yet
008 Architectural
45 pages
Lab Assignment 5 Poc
No ratings yet
Lab Assignment 5 Poc
6 pages
Lecture - 16 Data Flow and Systolic Array Architectures
No ratings yet
Lecture - 16 Data Flow and Systolic Array Architectures
15 pages
Real Time Signal Processing: SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
No ratings yet
Real Time Signal Processing: SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
18 pages
P 1
No ratings yet
P 1
44 pages
Rs 232
No ratings yet
Rs 232
15 pages
ADVANCED COMPUTER ARCHITECTURE - Parallelism, Scalability, Programmability
No ratings yet
ADVANCED COMPUTER ARCHITECTURE - Parallelism, Scalability, Programmability
9 pages
Research Statement
No ratings yet
Research Statement
2 pages
Fpga Implementation of A License Plate Recognition Soc Using Automatically Generated Streaming Accelerators
No ratings yet
Fpga Implementation of A License Plate Recognition Soc Using Automatically Generated Streaming Accelerators
8 pages
String Programs Resume
No ratings yet
String Programs Resume
6 pages
Ijeta V7i4p7
No ratings yet
Ijeta V7i4p7
6 pages
MSI Bluetooth Dongle Installation Guide
No ratings yet
MSI Bluetooth Dongle Installation Guide
8 pages
Software For Development and Communication With FPGA Based Hardware
No ratings yet
Software For Development and Communication With FPGA Based Hardware
8 pages
Summary Master Thesis
No ratings yet
Summary Master Thesis
3 pages
Proxpoint® Plus Reader: Physical Access Solutions
No ratings yet
Proxpoint® Plus Reader: Physical Access Solutions
2 pages
A Linear Algebra Core Design For Efficient Level-3 BLAS
No ratings yet
A Linear Algebra Core Design For Efficient Level-3 BLAS
4 pages
Main Seminar 'Autonomic Computing': Operating Systems and Middleware
No ratings yet
Main Seminar 'Autonomic Computing': Operating Systems and Middleware
10 pages
Aca305 2000
No ratings yet
Aca305 2000
8 pages
CEWE Prometer PDF
No ratings yet
CEWE Prometer PDF
2 pages
28895568
No ratings yet
28895568
9 pages
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
From Everand
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Queue Structures and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Queue Structures and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
From Everand
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Data Structures and Algorithms with Python: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Data Structures and Algorithms with Python: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
From Everand
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Jaeger Distributed Tracing in Practice: Definitive Reference for Developers and Engineers
From Everand
Jaeger Distributed Tracing in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Neural Networks and Fuzzy Logic
From Everand
Neural Networks and Fuzzy Logic
C. Naga Bhaskar
No ratings yet

Lasa Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs

Uploaded by

Lasa Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs

Uploaded by

2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Lasa: Abstraction and Specialization for Productive

2576-2621/23/$31.00 ©2023 IEEE 34

You might also like